Sunteți pe pagina 1din 22

Optimization

Refreshments

Machine Learning / Neural Networks and Fuzzy Logic

BITS F464/F312

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 1 / 22
Optimization

Optimization problem deals with maximizing or minimizing a real


function by systematically choosing input values from an allowed
domain and computing the value of the function.
The final goal of solving an optimization problem could be to (i) find
the optimized value of the function, or (ii) find the input(s) for which
the function is optimized.
Depending on application and as a matter of choice, an optimization
problem could be solved by using either (a) gradient-based
optimization techniques or (b) gradient(or, derivative)-free
optimization techniques.
Examples: (a) Gradient-based optimization: gradient descent,
Newton’s method; (b) Derivative-free optimization: Genetic
algorithm, Simulated annealing
Our focus shall be on the former techniques in this course.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 2 / 22
Descent methods – General idea I

Let an objective function (a.k.a. cost function, error function, risk


function) J(·) be defined on an n-dimensional input space
θ = [θ1 , θ2 , . . . , θn ]T . Our primary concern is to find a (possibly
local) minimum point θ = θ∗ that minimizes J(θ)
In general the function J(·) may be nonlinear form w.r.t. an
adjustable parameter θ. Due to complexity of J, an iterative
algorithm is often adopted to explore the input space (space of all θs)
efficiently.
In iterative descent methods, the next point θnext is determined by a
step down from the current point θnow in a direction vector d :

θnext = θnow + η d , (1)

where η is some positive step size regulating to what extent to


proceed in that direction. In ML, η is usually known as learning rate.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 3 / 22
Descent methods – General idea II

For algorithmic convenience, we write Eq(1)

θk+1 = θk + ηk dk (k = 1, 2, 3, . . .), (2)

where k denotes the current iteration number. The θk is intended to


converge to a (local) minimum θ ∗ .
The iterative descent methods compute the kth step ηk dk through
two procedures: first determining direction d , and then calculating
the step size η. The next point should satisfy the following inequality:

J(θnext ) = J(θnow + η d ) < J(θnow ) (3)

We will not focus on determination of the parameter η.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 4 / 22
Slope and Gradient
To find a local minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient (or of the approximate
gradient) of the function at the current point.
So, if the gradient is positive, then x = x − (grad x); and if the gradient
is negative, then x = x − (−grad x)

Figure 1: From: Google image plus edits (only for demonstration purpose)
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 5 / 22
Gradient-based methods I

When the straight downhill direction d is determined on the basis of the


gradient g of an objective function J(·), such descent methods are called
gradient-based descent methods. Specifically, when d = −g , d is the
steepest descent direction at the present point (say θnow ).
The gradient of a differentiable function J : R n → R at θ is the vector
of the first derivative of J, denoted as g . THat is,
 T
g (θ) (= ∇J(θ)) = ∂J(θ)
∂θ1
,
∂J(θ)
∂θ2
,...,
J(θ)
∂θn
(4)

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 6 / 22
Gradient-based methods II

Figure 2: feasible descent directions. Directions from the starting point θnow in
the shaded area are possible descent vector candidates. When d = −g , d is the
steepest descent direction at a local point θnow (Figure from the book:
Neuro-fuzzy and Soft Computing, J. -S. R. Jang et al.)

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 7 / 22
Gradient-based methods III
Since there are multiple dimensions of J, and so the gradient i.e. ∇J, for a
given gradient (g = ∇J) the downhill direction adhere to the following
condition for feasible descent direction:
dJ(θnow + η d )
φ0 (0) = |η=0 = g T d = ||g T || ||d ||cos(ψ(θnew )) < 0, (5)

where ψ signifies the angle between g and d , and so ψ(θnow ) denotes the
angle between gnow and d . This can be verified by the Taylor series
expansion of J:

J(θnow + η d ) = J(θnow ) + η g T d + O(η 2 ) + H.O.T . (6)

H.O.T. is the higher order term of η. The H.O.T. along with the second
order term O(η 2 ) will be dominated by g T d when η → 0.
It should be noted that the descent directions does not guarantee
convergence of the algorithm.
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 8 / 22
Gradient-based methods IV
A class of gradient-based descent methods has the following fundamental
form, in which feasible descent directions can be determined by deflecting
the gradients through the multiplication by G (i.e. deflected gradients):

θnext = θnow − η Gg , (7)

Clearly, when d = −Gg , the descent direction (Eq.(5)) holds since


g T d = −g T Gg < 0. There are many other variants one of which is
Levenberg-Marquardt method (discussion on this is out of scope and time
in this course).

As we are interested in finding the local minimum of the function J,


gradually we may land in some local turning point which is possibly a local
minimum. At the turning point (that could be a minimum point,
maximum point, saddle point), the slope will be 0 and hence the gradient

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 9 / 22
Gradient-based methods V

is 0. This means, we wish to find a value of θnow that satisfies the


following condition:

g (θnow ) = ∂J(θ)
∂θ
|θ=θ now =0 (8)

In practice, however, it is difficult to solve the above equation analytically.


For minimizing the objective function, the descent procedure are typically
repeated until some stopping condition.
The stopping condition could be one of the following:
The objective function value is sufficiently very small.
The length of the gradient vector g is smaller than a specified value.
The specified computing time is exceeded.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 10 / 22
Gradient-based methods VI
Effect of η:
Convergence is very slow for small η

Figure 3: Slow convergence for small η value (From: F. Rosenblatt, The


Perceptron - has Perceiving and Recognizing automaton , Tech. Report
85-460-1, Cornell Aeronautical Laboratory, Ithaca, NY, 1957.)

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 11 / 22
Gradient-based methods VII

Large η leads to divergence from the minimum point

Figure 4: Divergence for large η value (From: F. Rosenblatt, The Perceptron


- has Perceiving and Recognizing automaton , Tech. Report 85-460-1,
Cornell Aeronautical Laboratory, Ithaca, NY, 1957.)

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 12 / 22
Gradient-based methods VIII

Sometimes, an adaptive η value is preferred during the convergence


(training of a model).

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 13 / 22
Newton’s method for finding root of a real-valued function
(revision) I

Simple steps involved in Newton’s method:


Make an initial approximation of θ (which should be possibly close to
the optimal solution; however, such a guess is difficult)
Determine the new value of θ using the following Equation: (finding
out θ at iterative step t + 1 from value at step t)

J(θ(t)
θ(t + 1) = θ(t) − (9)
J 0 (θ(t))

If |θ(t + 1) − θ(t)| is less than the desired accuracy (which should be


specified), let θ(t + 1) serve as the final approximation. Otherwise,
return to step 2 and calculate a new approximation. [Each calculation
of a successive approximation is called an iteration.]

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 14 / 22
Newton’s method for finding root of a real-valued function
(revision) II

Convergence of Newton’s method:


If the iterations are getting closer and closer to the correct answer the
method is said to converge.
However, Newton’s method will not converge if
If J 0 (θ(t)) = 0 for some t
If limt→∞ θ(t) does not exists.

Example. Starting at initial θ = 1, use Newton’s method to approximate


a zero of the function J(θ) = θ2 − 2.
Solution. We are starting at initial guess θ(0) = 1. We also need to know
J 0 i.e. J 0 (θ) = 2θ. So, J 0 (θ(0) = 1) = 2. Now, we can apply Newton’s
method√ for few iterations to obtain the solution. [we should get
θ = 2 = 1.414]
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 15 / 22
Newton’s method for minimization of function I

Let focus the discussion on classical Newton’s method only.

In calculus, Newton’s method is an iterative method for finding the


roots of a differentiable function J (i.e. solutions to the equation
J(θ) = 0).
In optimization, Newton’s method is applied to the derivative J 0 of a
twice-differentiable function J to find the roots of the derivative (i.e.
solutions to J(θ) = 0), also known as the stationary points of J.

The descent direction d can be determined by using the second derivative


(Hessian) of the objective function J. For a general continuous objective
function, there could be a number of immediate local minimum points
near the vicinity of the minimum.
In such a case, if the starting position θnew is sufficiently close to a local
minimum, the objective function J is expected to be approximated by

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 16 / 22
Newton’s method for minimization of function II
quadratic form assuming that the higher order terms of ||θ − θnow || are
very small:

J(θ) ≈ J(θnow ) + g T (θ − θnow ) + (θ − θnow )T H (θ − θnow )


1
(10)
2!
[This is from the Taylor series expansion of a differentiable function f (·).
Note – A Taylor series is a series expansion of a function about a point.
An one-dimensional Taylor series is an expansion of a real function f (x)
about a point x = a is given by

f 00 (a) f 000 (a) f (n) (a)


f (x) = f (a)+f 0 (a)(x−a)+ (x−a)2 + (x−a)3 +. . .+ (x−a)n +
2! 3! n!
]
In this case, since our objective function is multi-dimensional the second
derivative of the function will give us a two-dimensional matrix of second
order partial derivatives (called Hessian H ) which is
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 17 / 22
Newton’s method for minimization of function III

∂2J ∂2J ∂2J


 
2 ∂θ1 ∂θ2 ... ∂θ1 ∂θn
 ∂∂θ2 J1 ∂2J ∂2J 

...
H = 

 ∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θn 
.. .. .. ..  (11)
 . . . . 

∂2J ∂2J ∂2J
∂θn ∂θ1 ∂θn ∂θ2 ... ∂θn2

For the function expansion in Eq.(10), we can find its minimum point θ̂ by
differentiating the Eq.(10) w.r.t. θ and setting it to zero. This
subsequently leads to a set of linear equations:

0 = g + H (θ̂ − θnow ) (12)

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 18 / 22
Newton’s method for minimization of function IV

If the inverse of H exists, we have a unique solution. Newton’s method


chooses the minimum point θ̂ (assume that the optimal point θ ∗ = θ̂) of
the approximated quadratic function as the next point from θnow ,

θ̂ = θnow − H −1 g (13)

This method is also known as Newton-Raphson method. The step H −1


is called the Newton step, and its direction is called the Netwon direction.
The general gradient-based formula in Eq.(7) reduces to Newton’s method
when G = −H −1 and η = 1.
Note that if the Hessian H is positive definite, denoted as H > 0 and
J(θ) is quadratic, then Newton’s method directly gets to a local
minimum in a single Newton step.
If that is not the case, then the Newton’s method should be
repeatedly be applied over a few iterations to get to the minimum.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 19 / 22
Newton’s method for minimization of function V

Note: A matrix is called positive definite if all its eigen values are
positive. Alternatively, for all x ∈ R and x 6= 0; x T Ax > 0.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 20 / 22
Neural Networks I

Neural Network, also called as Artificial Neural Network (ANN) for


some obvious reasons, is a network structure consisting of a number
of nodes (called neurons) connected through directional links (called
synaptic connections).
These synaptic connections are responsible for transferring
information from one neuron to another neuron through some
activation function (biologically, a biochemical phenomenon).
These connections specify causal relationship between the connected
nodes. These connections knowledge (called synaptic weights) about
the whole system which systematically updated with experience using
some learning rules.
These learning rule specifies how these parameters should be updated
to minimize a prescribed error measure, which is basically a
mathematical expression that measures the discrepancy between
network’s actual output and a desired output of the system.
Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 21 / 22
Neural Networks II

ANN is used for system identification, and our task is to find an


appropriate network architecture and a set of parameters which can
best model an unknown target system that is described by a set of
input–output data pairs.
There are many different ANN developed in the field. Though the
architecture are different for these models, many of those are based
on the fundamental learning rule used during the process of training.
A few basic ANNs include Perceptrons, Adaptive Linear Element
(Adaline), and multilayer perceptrons (MLP).
At present, we focus on the MLP and derivation of its learning
algorithm using the derivative-based optimization technique which we
studied earlier.

Machine Learning / Neural Networks and Fuzzy Logic Optimization BITS F464/F312 22 / 22

S-ar putea să vă placă și