Documente Academic
Documente Profesional
Documente Cultură
Rowel Atienza
rowel@eee.upd.edu.ph
P is the one that makes Optimization in Machine Learning different from just
pure optimization as the end goal itself.
Optimization
Loss function from an empirical distribution pdata (over the training set)
J( ) = E L(f(x; ),y)
(x,y)~pdata
y is the label
Usually, we want to minimize J* over the true data generating distribution pdata
prone to overfitting
Minibatch Stochastic
Using entire training set, known as deterministic, is expensive and has no linear
return; good estimate of gradient but low linear return
Use of minibatch stochastic (small subset of entire training set) offers many
advantages:
High cost but can be easily overcome by SGD; SGD are designed to move
downhill and not necessarily seek critical points
Newton method encounters difficulty dealing with saddle and global maxima
Most of the time, only near zero gradient points with resulting acceptable
performance
Challenges
Wrong side of the mountain: gradient descent will not find the minimum
Initialize weights and biases with different random values (symmetry breaking
effect)
Learning Rate:
Gradually decrease learning rate during training since after some time, the
gradient due to noise is more significant
k
= (1-k/ ) o + (k/ )
Generally, batch gradient is better than SGD in convergence. A technique that can
be used is to increase the batch size gradually.
Momentum on SGD for Speed Improvement
v v- g
+v
is momentum [0,1); typical 0.5, 0.9 and 0.99; the larger compared to , the
bigger is the influence of past gs; similar to snowballing effect
r r + gg
-g /( +r)
Effective for some deep learning but not all; Accumulation of early learning
rate can cause excessive decrease in learning rate
Adaptive Learning Rates
RMSProp
r r + (1- )gg
-g /( +r)
r r + (1- )gg
v v - g /(r)
+v
is momentum coefficient
Adaptive Learning Rates
Adam (Adaptive Moments)
tt+1
t
first moment: s 1
s + (1- 1)gg, s s/(1+ 1
)
t
second moment: r 2
r + (1- 2)gg, r r/(1+ 2
)
- s/( +r), +
where = small constant for numerical stabilization (eg 10-8), 1 and 2 [0, 1)
(suggest: 1 = 0.9, 2 = 0.999), t is time step, is suggested to be 0.001
Reference
Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT
Press, 2016, http://www.deeplearningbook.org
End