Foundations of Deep Learning

Foundations of Deep Learning
Julio Cesar Mendoza Bobadilla
Institute of Computing
University of Campinas
January 11, 2019

Definitions
Representation Learning. ”A set of methods that allows a

machine to be fed with raw data and to automatically discover the
representations needed for some task” 1
Deep Learning. ”Representation learning methods with multiple
levels of representation, obtained by composing simple but
nonlinear modules that each transform the representation at one
level into a higher, slightly more abstract one” 1
1
Lecun et al., Deep learning, 2015
Traditional Machine Learning Pipeline
I Traditional machine learning algorithms are based on

high-level attributes or features.
I This requires feature engineering.
Representation Learning Pipeline
I Learn features and classifier directly from raw data
I Also known as end-to-end learning.
Deep learning
I Learning a hierarchy of representations that build on each
other, from simple to complex
Multilayer Perceptron
I Pre-activation function for the k layer (h(0) (x) = x)
a(k) (x) = b (k) + W(k) h(k−1) (x)
I Hidden layer activation
h(k) (x) = g (a(k) (x))
I Ouput layer activation
h(L+1) (x) = o(a(L+1) (x) = f (x)
b (1) b (2) b (3)
x1
x2
x3
W(3)
x W(1) h(1) (x) W(2) h(2) (x)
Activation function
0.8
Sigmoid activation function
0.6
I Puts pre-activations between
0.4
0 and 1
I Always positive 0.2
I Bounded 0
I Strictly increasing −4 −2 0 2 4
1
g (x) =
(1 + exp(−x)
Activation function
5
Rectified linear activation
function 4
I Puts pre-activations between 3
0 and 1
2
I Always non negative
1
I Not upper bounded
I Strictly increasing 0
−4 −2 0 2 4
I Tends to give neurons with
sparse activities
g (x) = max(0, a)
Activation function
Softmax activation function
exp(ac )
softmax(ac ) = PC
1 exp(ai )
I Used for multiclass classification.

I Each activation is the conditional probability of the c-class
p(y = c|x)
" #T
exp(a1 ) exp(aC )
o(a) = softmax(a) = PC . . . PC
1 exp(ai ) 1 exp(ai )
I Strictly positive
I Sums to one
I Predicted class is the activation with highest estimated
probability.
Forward propagation
I Can be represented as an
acyclic flow graph.
I The output of each box can
be calculated given its
parents.
2
Hugo Lachorelle, Neural Networks, 2017
Optimization
We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization

I Slope is negative, go left!
Optimization

Optimization

I Slope is positive, go right!
Optimization

I Slope is positive, go right!
I General principle of gradient
descent optimization:
δf
v =u−α
δu
α is the learning rate
Empirical risk minimization
I Framework of design learning algorithms

1 X
argminθ L(f (x(t) ; θ), y (t) +λ Ω(θ)
T t | {z } |{z}
Loss function Regularizer
I We should optimize classification error, but it is not smooth

I Loss function is a surrogate for what we truly should optimize
Stochastic Gradient Descent
I Performs updates after each training example.
procedure StochasticGradientDescent
require: Network parameters θ
Initialize θ . (θ = W(1) , b (1) , ..., W(L+1) , b (L+1) ) )
for N epochs do
for each training sample (x (t) , y (t) ) do
∆ = −∇θ L(f (x(t) ; θ), y (t) ) − λ∇θ Ω(θ)
θ = θ + α∆
I To apply SGD algorithm we require:

I The loss function L(f (x(t) ; θ), y (t) )
I The regularizer Ω(θ)
I A procedure to compute the parameter gradients
∇θ L(f (x(t) ; θ), y (t) ) and λ∇θ Ω(θ)
I A procedure to initialize the parameters
Loss function
I On classification problems neural networks estimate:

f (x)c = p(y = c|x)
I Cross entropy minimize the negative log-likelihood.
X
L(f (x), y ) = − 1(y =c) log f (x)c (1)
c
I log is used for numerical stability and math simplicity.

Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
g ← ∇f (x) J = ∇f (x) L(f (x), y )
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
Gradient on the bias of the k-layer

∇b(k) J = g + λ∇b(k) Ω(θ)
Gradient on the weights of the k-layer

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
Propagate the gradients on the next lower level activations

g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
to bottom.
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

∇b(k) J = g + λ∇b(k) Ω(θ)

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
to bottom.
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
to bottom.
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
to bottom.
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Computing gradients
to bottom.
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Backward propagation
I Each node computes the

gradient of the loss with
respect to the node term.
I We first calculate the
gradient of the upper nodes.
I Allows automatic
differentiation.
Activation function
0.8
0.6
Sigmoid activation function
0.4
I Partial derivative:
0.2
0
g (a) = g (a)(1 − g (a)) 0
−4 −2 0 2 4
1
g (x) =
(1 + exp(−x)
Activation function
Rectified linear activation 3

function 2
I Partial derivative:
1
0
g (a) = 1a>0 0
−4 −2 0 2 4
g (x) = max(0, a)
Regularization
I L2 regularization.
X X X (k)
Ω(θ) = (Wi,j )2
k i j
I Gradient:
∇W(k) Ω(θ) = 2W(k)
Applications
I Develop statistical models that can discover underlying

structure, semantic relations, constraints from data.
I Applications:
I Speech Recognition
I Computer Vision
I Recommendation Systems
I Language Understanding
I Robotics
Applications

Foundations of Deep Learning

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Foundations of Deep Learning

Încărcat de

Drepturi de autor:

Formate disponibile

Foundations of Deep Learning

Julio Cesar Mendoza Bobadilla

January 11, 2019

Representation Learning. ”A set of methods that allows a

I Traditional machine learning algorithms are based on

b (1) b (2) b (3)

I Used for multiclass classification.

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

I Framework of design learning algorithms

I We should optimize classification error, but it is not smooth

I Performs updates after each training example.

I To apply SGD algorithm we require:

I On classification problems neural networks estimate:

I log is used for numerical stability and math simplicity.

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

I Each node computes the

Rectified linear activation 3

I Develop statistical models that can discover underlying

S-ar putea să vă placă și