Optimization in Neural Network

Optimization
Refreshments
Machine Learning / Neural Networks and Fuzzy Logic
BITS F464/F312
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 1 / 22
Optimization
Optimization problem deals with maximizing or minimizing a real

function by systematically choosing input values from an allowed
domain and computing the value of the function.
The final goal of solving an optimization problem could be to (i) find
the optimized value of the function, or (ii) find the input(s) for which
the function is optimized.
Depending on application and as a matter of choice, an optimization
problem could be solved by using either (a) gradient-based
optimization techniques or (b) gradient(or, derivative)-free
optimization techniques.
Examples: (a) Gradient-based optimization: gradient descent,
Newton’s method; (b) Derivative-free optimization: Genetic
algorithm, Simulated annealing
Our focus shall be on the former techniques in this course.
Descent methods – General idea I
Let an objective function (a.k.a. cost function, error function, risk

function) J(·) be defined on an n-dimensional input space
θ = [θ1 , θ2 , . . . , θn ]T . Our primary concern is to find a (possibly
local) minimum point θ = θ∗ that minimizes J(θ)
In general the function J(·) may be nonlinear form w.r.t. an
adjustable parameter θ. Due to complexity of J, an iterative
algorithm is often adopted to explore the input space (space of all θs)
efficiently.
In iterative descent methods, the next point θnext is determined by a
step down from the current point θnow in a direction vector d :
θnext = θnow + η d , (1)
where η is some positive step size regulating to what extent to

proceed in that direction. In ML, η is usually known as learning rate.
Descent methods – General idea II
For algorithmic convenience, we write Eq(1)
θk+1 = θk + ηk dk (k = 1, 2, 3, . . .), (2)
where k denotes the current iteration number. The θk is intended to

converge to a (local) minimum θ ∗ .
The iterative descent methods compute the kth step ηk dk through
two procedures: first determining direction d , and then calculating
the step size η. The next point should satisfy the following inequality:
J(θnext ) = J(θnow + η d ) < J(θnow ) (3)
We will not focus on determination of the parameter η.
Slope and Gradient
To find a local minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient (or of the approximate
gradient) of the function at the current point.
So, if the gradient is positive, then x = x − (grad x); and if the gradient
is negative, then x = x − (−grad x)
Figure 1: From: Google image plus edits (only for demonstration purpose)
Gradient-based methods I
When the straight downhill direction d is determined on the basis of the

gradient g of an objective function J(·), such descent methods are called
gradient-based descent methods. Specifically, when d = −g , d is the
steepest descent direction at the present point (say θnow ).
The gradient of a differentiable function J : R n → R at θ is the vector
of the first derivative of J, denoted as g . THat is,
T
g (θ) (= ∇J(θ)) = ∂J(θ)
∂θ1
,
∂J(θ)
∂θ2
,...,
J(θ)
∂θn
(4)
Gradient-based methods II
Figure 2: feasible descent directions. Directions from the starting point θnow in
the shaded area are possible descent vector candidates. When d = −g , d is the
steepest descent direction at a local point θnow (Figure from the book:
Neuro-fuzzy and Soft Computing, J. -S. R. Jang et al.)
Gradient-based methods III
Since there are multiple dimensions of J, and so the gradient i.e. ∇J, for a
given gradient (g = ∇J) the downhill direction adhere to the following
condition for feasible descent direction:
dJ(θnow + η d )
φ0 (0) = |η=0 = g T d = ||g T || ||d ||cos(ψ(θnew )) < 0, (5)
dη
where ψ signifies the angle between g and d , and so ψ(θnow ) denotes the
angle between gnow and d . This can be verified by the Taylor series
expansion of J:
J(θnow + η d ) = J(θnow ) + η g T d + O(η 2 ) + H.O.T . (6)
H.O.T. is the higher order term of η. The H.O.T. along with the second
order term O(η 2 ) will be dominated by g T d when η → 0.
It should be noted that the descent directions does not guarantee
convergence of the algorithm.
Gradient-based methods IV
A class of gradient-based descent methods has the following fundamental
form, in which feasible descent directions can be determined by deflecting
the gradients through the multiplication by G (i.e. deflected gradients):
θnext = θnow − η Gg , (7)
Clearly, when d = −Gg , the descent direction (Eq.(5)) holds since

g T d = −g T Gg < 0. There are many other variants one of which is
Levenberg-Marquardt method (discussion on this is out of scope and time
in this course).
As we are interested in finding the local minimum of the function J,

gradually we may land in some local turning point which is possibly a local
minimum. At the turning point (that could be a minimum point,
maximum point, saddle point), the slope will be 0 and hence the gradient
Gradient-based methods V
is 0. This means, we wish to find a value of θnow that satisfies the

following condition:
g (θnow ) = ∂J(θ)
∂θ
|θ=θ now =0 (8)
In practice, however, it is difficult to solve the above equation analytically.

For minimizing the objective function, the descent procedure are typically
repeated until some stopping condition.
The stopping condition could be one of the following:
The objective function value is sufficiently very small.
The length of the gradient vector g is smaller than a specified value.
The specified computing time is exceeded.
Gradient-based methods VI
Effect of η:
Convergence is very slow for small η
Figure 3: Slow convergence for small η value (From: F. Rosenblatt, The

Perceptron - has Perceiving and Recognizing automaton , Tech. Report
85-460-1, Cornell Aeronautical Laboratory, Ithaca, NY, 1957.)
Gradient-based methods VII
Large η leads to divergence from the minimum point
Figure 4: Divergence for large η value (From: F. Rosenblatt, The Perceptron

- has Perceiving and Recognizing automaton , Tech. Report 85-460-1,
Cornell Aeronautical Laboratory, Ithaca, NY, 1957.)
Gradient-based methods VIII
Sometimes, an adaptive η value is preferred during the convergence

(training of a model).
Newton’s method for finding root of a real-valued function
(revision) I
Simple steps involved in Newton’s method:

Make an initial approximation of θ (which should be possibly close to
the optimal solution; however, such a guess is difficult)
Determine the new value of θ using the following Equation: (finding
out θ at iterative step t + 1 from value at step t)
J(θ(t)
θ(t + 1) = θ(t) − (9)
J 0 (θ(t))
If |θ(t + 1) − θ(t)| is less than the desired accuracy (which should be

specified), let θ(t + 1) serve as the final approximation. Otherwise,
return to step 2 and calculate a new approximation. [Each calculation
of a successive approximation is called an iteration.]
Newton’s method for finding root of a real-valued function
(revision) II
Convergence of Newton’s method:

If the iterations are getting closer and closer to the correct answer the
method is said to converge.
However, Newton’s method will not converge if
If J 0 (θ(t)) = 0 for some t
If limt→∞ θ(t) does not exists.
Example. Starting at initial θ = 1, use Newton’s method to approximate

a zero of the function J(θ) = θ2 − 2.
Solution. We are starting at initial guess θ(0) = 1. We also need to know
J 0 i.e. J 0 (θ) = 2θ. So, J 0 (θ(0) = 1) = 2. Now, we can apply Newton’s
method√ for few iterations to obtain the solution. [we should get
θ = 2 = 1.414]
Newton’s method for minimization of function I
Let focus the discussion on classical Newton’s method only.
In calculus, Newton’s method is an iterative method for finding the

roots of a differentiable function J (i.e. solutions to the equation
J(θ) = 0).
In optimization, Newton’s method is applied to the derivative J 0 of a
twice-differentiable function J to find the roots of the derivative (i.e.
solutions to J(θ) = 0), also known as the stationary points of J.
The descent direction d can be determined by using the second derivative

(Hessian) of the objective function J. For a general continuous objective
function, there could be a number of immediate local minimum points
near the vicinity of the minimum.
In such a case, if the starting position θnew is sufficiently close to a local
minimum, the objective function J is expected to be approximated by
Newton’s method for minimization of function II
quadratic form assuming that the higher order terms of ||θ − θnow || are
very small:
J(θ) ≈ J(θnow ) + g T (θ − θnow ) + (θ − θnow )T H (θ − θnow )

1
(10)
2!
[This is from the Taylor series expansion of a differentiable function f (·).
Note – A Taylor series is a series expansion of a function about a point.
An one-dimensional Taylor series is an expansion of a real function f (x)
about a point x = a is given by
f 00 (a) f 000 (a) f (n) (a)

f (x) = f (a)+f 0 (a)(x−a)+ (x−a)2 + (x−a)3 +. . .+ (x−a)n +
2! 3! n!
]
In this case, since our objective function is multi-dimensional the second
derivative of the function will give us a two-dimensional matrix of second
order partial derivatives (called Hessian H ) which is
Newton’s method for minimization of function III
∂2J ∂2J ∂2J

 
2 ∂θ1 ∂θ2 ... ∂θ1 ∂θn
 ∂∂θ2 J1 ∂2J ∂2J 

...
H = 

 ∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θn 
.. .. .. ..  (11)
 . . . . 

∂2J ∂2J ∂2J
∂θn ∂θ1 ∂θn ∂θ2 ... ∂θn2
For the function expansion in Eq.(10), we can find its minimum point θ̂ by
differentiating the Eq.(10) w.r.t. θ and setting it to zero. This
subsequently leads to a set of linear equations:
0 = g + H (θ̂ − θnow ) (12)
Newton’s method for minimization of function IV
If the inverse of H exists, we have a unique solution. Newton’s method

chooses the minimum point θ̂ (assume that the optimal point θ ∗ = θ̂) of
the approximated quadratic function as the next point from θnow ,
θ̂ = θnow − H −1 g (13)
This method is also known as Newton-Raphson method. The step H −1

is called the Newton step, and its direction is called the Netwon direction.
The general gradient-based formula in Eq.(7) reduces to Newton’s method
when G = −H −1 and η = 1.
Note that if the Hessian H is positive definite, denoted as H > 0 and
J(θ) is quadratic, then Newton’s method directly gets to a local
minimum in a single Newton step.
If that is not the case, then the Newton’s method should be
repeatedly be applied over a few iterations to get to the minimum.
Newton’s method for minimization of function V
Note: A matrix is called positive definite if all its eigen values are
positive. Alternatively, for all x ∈ R and x 6= 0; x T Ax > 0.
Neural Networks I
Neural Network, also called as Artificial Neural Network (ANN) for

some obvious reasons, is a network structure consisting of a number
of nodes (called neurons) connected through directional links (called
synaptic connections).
These synaptic connections are responsible for transferring
information from one neuron to another neuron through some
activation function (biologically, a biochemical phenomenon).
These connections specify causal relationship between the connected
nodes. These connections knowledge (called synaptic weights) about
the whole system which systematically updated with experience using
some learning rules.
These learning rule specifies how these parameters should be updated
to minimize a prescribed error measure, which is basically a
mathematical expression that measures the discrepancy between
network’s actual output and a desired output of the system.
Neural Networks II
ANN is used for system identification, and our task is to find an

appropriate network architecture and a set of parameters which can
best model an unknown target system that is described by a set of
input–output data pairs.
There are many different ANN developed in the field. Though the
architecture are different for these models, many of those are based
on the fundamental learning rule used during the process of training.
A few basic ANNs include Perceptrons, Adaptive Linear Element
(Adaline), and multilayer perceptrons (MLP).
At present, we focus on the MLP and derivation of its learning
algorithm using the derivative-based optimization technique which we
studied earlier.

Optimization in Neural Network

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Optimization in Neural Network

Încărcat de

Drepturi de autor:

Formate disponibile

Optimization

Machine Learning / Neural Networks and Fuzzy Logic

Optimization problem deals with maximizing or minimizing a real

Let an objective function (a.k.a. cost function, error function, risk

θnext = θnow + η d , (1)

where η is some positive step size regulating to what extent to

For algorithmic convenience, we write Eq(1)

θk+1 = θk + ηk dk (k = 1, 2, 3, . . .), (2)

where k denotes the current iteration number. The θk is intended to

J(θnext ) = J(θnow + η d ) < J(θnow ) (3)

We will not focus on determination of the parameter η.

When the straight downhill direction d is determined on the basis of the

J(θnow + η d ) = J(θnow ) + η g T d + O(η 2 ) + H.O.T . (6)

θnext = θnow − η Gg , (7)

Clearly, when d = −Gg , the descent direction (Eq.(5)) holds since

As we are interested in finding the local minimum of the function J,

is 0. This means, we wish to find a value of θnow that satisfies the

In practice, however, it is difficult to solve the above equation analytically.

Figure 3: Slow convergence for small η value (From: F. Rosenblatt, The

Large η leads to divergence from the minimum point

Figure 4: Divergence for large η value (From: F. Rosenblatt, The Perceptron

Sometimes, an adaptive η value is preferred during the convergence

Simple steps involved in Newton’s method:

If |θ(t + 1) − θ(t)| is less than the desired accuracy (which should be

Convergence of Newton’s method:

Example. Starting at initial θ = 1, use Newton’s method to approximate

Let focus the discussion on classical Newton’s method only.

In calculus, Newton’s method is an iterative method for finding the

The descent direction d can be determined by using the second derivative

J(θ) ≈ J(θnow ) + g T (θ − θnow ) + (θ − θnow )T H (θ − θnow )

f 00 (a) f 000 (a) f (n) (a)

∂2J ∂2J ∂2J

0 = g + H (θ̂ − θnow ) (12)

If the inverse of H exists, we have a unique solution. Newton’s method

This method is also known as Newton-Raphson method. The step H −1

Neural Network, also called as Artificial Neural Network (ANN) for

ANN is used for system identification, and our task is to find an

S-ar putea să vă placă și