Sunteți pe pagina 1din 7

1 On artificial neural networks

As [Fausett, 1994] states, an artificial neural network is a system designed for information-
processing with certain characteristics that are similar to the biological neural networks. Thus,
the artificial neural networks have been developed as generalizations of mathematical models
associated with neural biology, based on the following hypotheses:

1. The information is being processed by simple elements, i.e., neurons.

2. The information flows between neurons over connection links.

3. Each connection link has an associated weight that multiplies the signal flowing in the link.

4. The neuron feeds its net input, i.e., the sum of weighted input signals, through an activation
function (see Section 1.1)

Furthermore, [Fausett, 1994] proposes that the neural network should be characterized by its
architecture, its learning algorithm and its activation function. The architecture of the neural
network represents its pattern of connections between the neurons.

§1.1 Activation function


The activation function of a neuron defines its output value; it can vary, depending on the
application of the neural network, from a simple binary function to the hyperbolic tangent.
Following, the common activation functions used in the field of artificial neural networks are
discussed.

The Identity function and it has the mathematical formulation:

f (x) = x (1.1)

has the range (−∞, ∞), is monotonic and has the order of continuity C ∞ . Figure 1.1 illustrates
the function’s graph.

Figure 1.1: Identity activation function.

1
CHAPTER 1. ON ARTIFICIAL NEURAL NETWORKS

The Step function (or the threshold function) is one of the first implemented activation func-
tions mainly because of its simplicity. The drawback, however, is the discontinuity of its derivative
which render the function obsolete for gradient-based training. The equation is:
(
0, x < 0,
f (x) = (1.2)
1, x ≥ 0.

It’s easy to notice that this function has the range {0, 1}. Figure 1.2 depicts the function’s graph.

Figure 1.2: Identity activation function.

The Logistic function is a sigmoid curve, i.e., a function having a characteristic S -shaped
curve. This activation function is widely used in classification problems. The use of this activa-
tion function presents the disadvantage that there is a high probability of a neuron to reach the
saturation, at which point the convergence of the learning algorithm becomes slow.

Mathematically, it can be described as follows:


L
f (x) = , (1.3)
1+ e−k(x−x0 )
where L is the curve’s maximum value, k represents the steepness of the curve, x0 symbolizes
the x-value of the sigmoid’s inflection point and e is Euler’s number.

The standard logistic function has the parameters k = 1, x0 = 0, and L = 1; this yields the
following expression and is illustrated in Figure 1.3:
1 ex
f (x) = σ(x) = = . (1.4)
1 + e−x 1 + ex
The advantage of this function is the easily calculated derivative, which is also known as the
logistic distribution.
d ex
f (x) = = f (x) (1 − f (x)) . (1.5)
dx (1 + ex )2

The Hyperbolic tangent function is similar to the logistic function but its inflexion point is
centered in origin; it inherits, thus, the same disadvantages. The analytical expression of this
function is:
ex − e−x
f (x) = tanh(x) = x (1.6)
e + e−x

2
CHAPTER 1. ON ARTIFICIAL NEURAL NETWORKS

Figure 1.3: Standard Logistic activation function.

and its graph is shown in Figure 1.4. The range of this monotonic function is (−1, 1) and its
order of continuity is C ∞ . The derivative of Equation 1.6 is:
d
f (x) = 1 − f (x)2 . (1.7)
dx

Figure 1.4: Hyperbolic tangent activation function.

The Rectified linear unit – short, ReLU – is a common used function in convolution neural
network applications. As [Hahnloser et al., 2000] states, ReLu has a behaviour similar to that of
a biological neuron and has mathematical advantages, i.e., only the addition, multiplication and
comparison operators are involved.

The function is described by:


(
0, x ≤ 0,
f (x) = x+ = (1.8)
x, x > 0.

The range of f shown in Equation 1.8 is [0, ∞). Both the function and its derivative are mono-
tonic, and the order of continuity of f is C 0 . It’s also easy to see that the function is scale
invariant, i.e., f (a · x) = a · f (x), ∀a ≥ 0. Figure 1.5 depicts the graph of f .
(
d 0, x ≤ 0,
f (x) = (1.9)
dx 1, x > 0.

3
CHAPTER 1. ON ARTIFICIAL NEURAL NETWORKS

Figure 1.5: Rectified linear unit activation function.

A potential problem of ReLU is the dying ReLU problem, in which case the neurons reach
a state where they become unresponsive for all the input signals. This leads to the possible
scenario where a large number of neurons in a network become unresponsive, decreasing the
model capacity.

§1.2 Single-layer feed-forward network


The Perceptron was introduced by [Rosenblatt, 1958] as a neural network unit. As depicted
in Figure 1.2, this unit has only one layer of neurons with adaptable weights and biases. The
mathematical description of a perceptron can be stated as follows:
(
1, wT x + b > 0,
f (x) = (1.10)
0, otherwise.

where w ∈ Rn is the weighting vector, i.e., w


P= (w1 , ..., wn )T , x ∈ Rn is the input vector and
T T n
w x represents the dot product, i.e., w x = i=0 wi xi . The scalar b ∈ R is a scalar also known
as bias.

Input Output

.. ..
. .

Figure 1.6: Perceptron.

Equation 1.10 naturally states that the Perceptron does binary classification because the range

4
CHAPTER 1. ON ARTIFICIAL NEURAL NETWORKS

of the function f is {0, 1}. Consequently, the bias b only defines a decision boundary that alters
the mapping of the input to an instance of the output.

For the learning algorithm to converge, the data set used for Perceptron’s training has to be
linearly separable, i.e., there has to exist at least one hyperplane that separates this set. If the
vectors from the data set are not linearly separable the learning cannot reach a point where all
the vectors x are classified correctly.

The training algorithm begins with the initialization of the weights and bias with pseudo-
random numbers. It then iterates through the training data set T that contains the input
vectors xi and known, or target, outputs ti , where i denotes a sample from T . Two steps are to
be performed:

1. Compute the function’s output (here the ˆ· symbol suggests an estimate):


n
!
X
ŷj = f (w(t) · xj ) = f wi (t) · xj,i . (1.11)
i=0

2. Update the weighting vector according to the learning rate r:

wi (t + 1) = wi (t) + r · (tj,i − ŷj,i )xj,i . (1.12)

Here r is a real, a priori chosen parameter.

The multiclass Perceptron is a generalization to multiclass classification, that is, the clasif-
fication into one of more than two classes [Aly, 2005]. A mathematical representation is given
by a function f (x, y) that maps a pair of input and output to a vector ϕ ∈ Rp . The estimated
output is chosen as:
ŷ = arg max f (x, y)w (1.13)
y

This formulation reduces to the classical Perceptron problem when y is bound to be {0, 1} and
f (x, y) = yx.

§1.3 Multilayer feed-forward network


By contrast, a multilayer feed-forward network has one or more hidden layers between the input
and the output layers. These layers may be fully or partially connected, having a high degree of
connectivity. Figure 1.3 gives an intuitive representation of a fully connected neural network.

Generally, a multilayer’s weights are adjusted using a backpropagation algorithm, which was
first described in [Werbos, 1974, pp. II-28–36]. The essence of this algorithm implies computing
the gradient of the loss function with respect to each weight using the chain rule. The gradient is
computed one layer at a time, doing a backward iteration from the last layer in order to eliminate
redundant calculations of intermediate terms which arise in the chain rule.

The aforementioned loss function can be particularized for each application; a common choice
is the mean squared error (MSE). Literature shows that MSE can be used to asses the quality of
a predictor, i.e., a function that maps arbitrary inputs to a set of values. [Wang and Bovik, 2009]
argues that “MSE is an excellent metric in the context of optimization” because it has desirable
properties such as convexity, symmetry and differentiability; as such, it becomes feasible to for-
mulate iterative numerical optimization procedures for the gradient and the Hessian of MSE can
be easily computed.

5
CHAPTER 1. ON ARTIFICIAL NEURAL NETWORKS

..
.

.. ..
. .

..
.

Figure 1.7: Multilayer feed-forward network (fully connected).

Using a MSE approach, the error for each neuron in the output layer can be computed with
the formula
N
1 X
e= (t[k] − ŷ[k])2 , (1.14)
N
k=1

where N is the dimension of the training set, t is the target output and ŷ is the predicted output.

Various algorithms can be used to update the weights of the model, such as Levenberg-
Marquardt [Yu and Wilamowski, 2011] or Gradient descent [Johansson et al., 1991]. Most
notably, the Levenberg-Marquardt algorithm provides a numerical solution in the context of
minimizing a non-linear function, with a fast and stable convergence. For the training of neu-
ral network models, this algorithm is best suited for small and medium-sized problems. The
algorithm blends two popular methods, i.e., the steepest descent method and the Gauss-Newton
algorithm; as such, it inherits the convergence speed of Gauss-Newton algorithm and the con-
vergence stability of the steepest descent method.

6
Bibliography
M. Aly. Survey on multiclass classification methods. Neural Networks, 19:1–9, 2005.

L. Fausett. Fundamentals of neural networks: architectures, algorithms, and applications.


Prentice-Hall, Inc., 1994.

R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital


selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405
(6789):947, 2000.

E. M. Johansson, F. U. Dowla, and D. M. Goodman. Backpropagation learning for multilayer


feed-forward neural networks using the conjugate gradient method. International Journal of
Neural Systems, 2(04):291–301, 1991.

F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386, 1958.

Z. Wang and A. C. Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity
measures. IEEE signal processing magazine, 26(1):98–117, 2009.

P. Werbos. Beyond regression: new tools for prediction and analysis in the behavioral sciences,
Harvard University. Masters Thesis. 1974.

H. Yu and B. M. Wilamowski. Levenberg-Marquardt training. Industrial electronics handbook,


5(12):1, 2011.

S-ar putea să vă placă și