Op Tim Ization

Optimization
Rowel Atienza
rowel@eee.upd.edu.ph
University of the Philippines

Optimization
Finding the parameters, , of a neural network that significantly reduce the cost
function J( )
Measured in terms of a performance measure, P, on the entire training set

and some regularization terms
P is the one that makes Optimization in Machine Learning different from just
pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pdata (over the training set)
J( ) = E L(f(x; ),y)
(x,y)~pdata
f(x; ) per sample prediction
y is the label
Usually, we want to minimize J* over the true data generating distribution pdata
J*( ) = E(x,y)~pdataL(f(x; ),y)

Empirical Risk Minimization
Empirical Risk Minimization:
J( ) = E L(f(x; ),y) = 1/mi=1mL(f(x(i); ),y(i))

(x,y)~pdata
m is the number of sample
prone to overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear
return; good estimate of gradient but low linear return
Use of minibatch stochastic (small subset of entire training set) offers many
advantages:
Suitable for parallelization
GPUs perform better on power of 2 sizes, 32 to 256
Small batches offer regularizing effects; improves generalization error
Shuffle minibatch, make minibatches independent improves training

Challenges
Design of Loss Function : Convex
Ill conditioning of Hessian Matrix, H
Second order Taylor expansion of Loss: gTHg - gTg
If the first term exceeds gTg, slow learning
Problem with multiple local minima
If the loss function can be reduced to an acceptable level, parameters at local

minimum are acceptable
Challenges Saddle pt
Saddle points: found in high-dimensional model
High cost but can be easily overcome by SGD; SGD are designed to move
downhill and not necessarily seek critical points
Newton method encounters difficulty dealing with saddle and global maxima
Saddle-free Newton method can overcome saddle points - research in

progress
Challenges
Cliff
Gradient descent proposes a large change
thus missing the minimum - Exploding Gradient
Solution is to use gradient clipping - capping the gradient
Common problem is Recurrent Neural Networks

Challenges
Long Term Dependencies (eg RNN, LSTM)
Performing the same computation many times
Applying the same W t-times
Wt = (Vdiag( )V-1)t = Vdiag( )tV-1
if | |<1, the term vanishes as t increases
if | |>1, the term explodes as t increases
Gradients are influenced by diag( )

Challenges
Inexact gradients due to noisy or biased estimates
Local and Global Structure
Optimization does not necessarily lead to a critical pt (global, local or saddle).
Most of the time, only near zero gradient points with resulting acceptable
performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum
Solution: algorithm for choosing the initial points
Bad initial points send the

objective function to the
wrong side of the mountain
Parameter Initialization
Initial point determines if the objective function will converge or not
Modern initialization strategies are simple and heuristics
Optimization for neural network is not well understood yet
Initialize weights and biases with different random values (symmetry breaking
effect)
Large weights are good for optimization
Small weights are good for regularization

Parameter Initialization
Biases:
Small values (0.1) for ReLU activation
1 for LSTM forget state
For output layer with highly skewed output c, we solve softmax(b) = c

Stochastic Gradient Descent
Instead of using the whole training set, we use a minibatch of m iid samples
Learning Rate:
Gradually decrease learning rate during training since after some time, the
gradient due to noise is more significant
Apply learning rate decay until t= when it set to constant
k
= (1-k/ ) o + (k/ )
is usually chosen by trial and error while observing all errors.

Stochastic Gradient Descent
Theoretically, the excess error = J( ) - Jmin( ), has lower bound of O(1/k) for
convex functions. Anything faster than O(1/k) will not improve the generalization
error. Thus resulting to overfitting.
Generally, batch gradient is better than SGD in convergence. A technique that can
be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v v- g
+v
where v is the accumulator of gradient g; v includes influence of past

gradients, g
is momentum [0,1); typical 0.5, 0.9 and 0.99; the larger compared to , the
bigger is the influence of past gs; similar to snowballing effect
Nesterov Momentum: Loss is evaluated after the momentum is applied g g( +

v)
Adaptive Learning Rates
AdaGrad (Adaptive Gradient) : learning rate is proportional to partial derivatives of
loss
r r + gg
-g /( +r)
where = small constant (eg 10-7)
Effective for some deep learning but not all; Accumulation of early learning
rate can cause excessive decrease in learning rate
RMSProp
r r + (1- )gg
-g /( +r)
where = small constant (eg 10-7); is the decay rate
Discard history from extreme past
Effective and practical for deep neural nets

RMSProp with Nesterov Momentum
r r + (1- )gg
v v - g /(r)
+v
where = small constant (eg 10-7)
is the decay rate
is momentum coefficient
Adam (Adaptive Moments)
tt+1
t
first moment: s 1
s + (1- 1)gg, s s/(1+ 1
)
t
second moment: r 2
r + (1- 2)gg, r r/(1+ 2
)
- s/( +r), +
where = small constant for numerical stabilization (eg 10-8), 1 and 2 [0, 1)
(suggest: 1 = 0.9, 2 = 0.999), t is time step, is suggested to be 0.001
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT
Press, 2016, http://www.deeplearningbook.org
End

Op Tim Ization

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Op Tim Ization

Încărcat de

Drepturi de autor:

Formate disponibile

Optimization

University of the Philippines

Measured in terms of a performance measure, P, on the entire training set

f(x; ) per sample prediction

J*( ) = E(x,y)~pdataL(f(x; ),y)

J( ) = E L(f(x; ),y) = 1/mi=1mL(f(x(i); ),y(i))

m is the number of sample

Suitable for parallelization

GPUs perform better on power of 2 sizes, 32 to 256

Small batches offer regularizing effects; improves generalization error

Shuffle minibatch, make minibatches independent improves training

Ill conditioning of Hessian Matrix, H

Second order Taylor expansion of Loss: gTHg - gTg

If the first term exceeds gTg, slow learning

Problem with multiple local minima

If the loss function can be reduced to an acceptable level, parameters at local

Saddle points: found in high-dimensional model

Saddle-free Newton method can overcome saddle points - research in

Gradient descent proposes a large change

thus missing the minimum - Exploding Gradient

Solution is to use gradient clipping - capping the gradient

Common problem is Recurrent Neural Networks

Performing the same computation many times

Applying the same W t-times

Wt = (Vdiag( )V-1)t = Vdiag( )tV-1

if | |<1, the term vanishes as t increases

if | |>1, the term explodes as t increases

Gradients are influenced by diag( )

Local and Global Structure

Optimization does not necessarily lead to a critical pt (global, local or saddle).

Solution: algorithm for choosing the initial points

Bad initial points send the

Modern initialization strategies are simple and heuristics

Optimization for neural network is not well understood yet

Large weights are good for optimization

Small weights are good for regularization

Small values (0.1) for ReLU activation

1 for LSTM forget state

For output layer with highly skewed output c, we solve softmax(b) = c

Apply learning rate decay until t= when it set to constant

is usually chosen by trial and error while observing all errors.

where v is the accumulator of gradient g; v includes influence of past

Nesterov Momentum: Loss is evaluated after the momentum is applied g g( +

where = small constant (eg 10-7)

where = small constant (eg 10-7); is the decay rate

Discard history from extreme past

Effective and practical for deep neural nets

where = small constant (eg 10-7)

is the decay rate

S-ar putea să vă placă și