Sunteți pe pagina 1din 22

Optimization

Rowel Atienza
rowel@eee.upd.edu.ph

University of the Philippines


Optimization
Finding the parameters, , of a neural network that significantly reduce the cost
function J( )

Measured in terms of a performance measure, P, on the entire training set


and some regularization terms

P is the one that makes Optimization in Machine Learning different from just
pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pdata (over the training set)

J( ) = E L(f(x; ),y)
(x,y)~pdata

f(x; ) per sample prediction

y is the label

Usually, we want to minimize J* over the true data generating distribution pdata

J*( ) = E(x,y)~pdataL(f(x; ),y)


Empirical Risk Minimization
Empirical Risk Minimization:

J( ) = E L(f(x; ),y) = 1/mi=1mL(f(x(i); ),y(i))


(x,y)~pdata

m is the number of sample

prone to overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear
return; good estimate of gradient but low linear return

Use of minibatch stochastic (small subset of entire training set) offers many
advantages:

Suitable for parallelization

GPUs perform better on power of 2 sizes, 32 to 256

Small batches offer regularizing effects; improves generalization error

Shuffle minibatch, make minibatches independent improves training


Challenges
Design of Loss Function : Convex

Ill conditioning of Hessian Matrix, H

Second order Taylor expansion of Loss: gTHg - gTg

If the first term exceeds gTg, slow learning

Problem with multiple local minima

If the loss function can be reduced to an acceptable level, parameters at local


minimum are acceptable
Challenges Saddle pt

Saddle points: found in high-dimensional model

High cost but can be easily overcome by SGD; SGD are designed to move
downhill and not necessarily seek critical points

Newton method encounters difficulty dealing with saddle and global maxima

Saddle-free Newton method can overcome saddle points - research in


progress
Challenges
Cliff

Gradient descent proposes a large change

thus missing the minimum - Exploding Gradient

Solution is to use gradient clipping - capping the gradient

Common problem is Recurrent Neural Networks


Challenges
Long Term Dependencies (eg RNN, LSTM)

Performing the same computation many times

Applying the same W t-times

Wt = (Vdiag( )V-1)t = Vdiag( )tV-1

if | |<1, the term vanishes as t increases

if | |>1, the term explodes as t increases

Gradients are influenced by diag( )


Challenges
Inexact gradients due to noisy or biased estimates

Local and Global Structure

Optimization does not necessarily lead to a critical pt (global, local or saddle).

Most of the time, only near zero gradient points with resulting acceptable
performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum

Solution: algorithm for choosing the initial points

Bad initial points send the


objective function to the
wrong side of the mountain
Parameter Initialization
Initial point determines if the objective function will converge or not

Modern initialization strategies are simple and heuristics

Optimization for neural network is not well understood yet

Initialize weights and biases with different random values (symmetry breaking
effect)

Large weights are good for optimization

Small weights are good for regularization


Parameter Initialization
Biases:

Small values (0.1) for ReLU activation

1 for LSTM forget state

For output layer with highly skewed output c, we solve softmax(b) = c


Stochastic Gradient Descent
Instead of using the whole training set, we use a minibatch of m iid samples

Learning Rate:

Gradually decrease learning rate during training since after some time, the
gradient due to noise is more significant

Apply learning rate decay until t= when it set to constant

k
= (1-k/ ) o + (k/ )

is usually chosen by trial and error while observing all errors.


Stochastic Gradient Descent
Theoretically, the excess error = J( ) - Jmin( ), has lower bound of O(1/k) for
convex functions. Anything faster than O(1/k) will not improve the generalization
error. Thus resulting to overfitting.

Generally, batch gradient is better than SGD in convergence. A technique that can
be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v v- g

+v

where v is the accumulator of gradient g; v includes influence of past


gradients, g

is momentum [0,1); typical 0.5, 0.9 and 0.99; the larger compared to , the
bigger is the influence of past gs; similar to snowballing effect

Nesterov Momentum: Loss is evaluated after the momentum is applied g g( +


v)
Adaptive Learning Rates
AdaGrad (Adaptive Gradient) : learning rate is proportional to partial derivatives of
loss

r r + gg

-g /( +r)

where = small constant (eg 10-7)

Effective for some deep learning but not all; Accumulation of early learning
rate can cause excessive decrease in learning rate
Adaptive Learning Rates
RMSProp

r r + (1- )gg

-g /( +r)

where = small constant (eg 10-7); is the decay rate

Discard history from extreme past

Effective and practical for deep neural nets


Adaptive Learning Rates
RMSProp with Nesterov Momentum

r r + (1- )gg

v v - g /(r)

+v

where = small constant (eg 10-7)

is the decay rate

is momentum coefficient
Adaptive Learning Rates
Adam (Adaptive Moments)

tt+1
t
first moment: s 1
s + (1- 1)gg, s s/(1+ 1
)
t
second moment: r 2
r + (1- 2)gg, r r/(1+ 2
)

- s/( +r), +

where = small constant for numerical stabilization (eg 10-8), 1 and 2 [0, 1)
(suggest: 1 = 0.9, 2 = 0.999), t is time step, is suggested to be 0.001
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT
Press, 2016, http://www.deeplearningbook.org
End

S-ar putea să vă placă și