Sunteți pe pagina 1din 30

Foundations of Deep Learning

Julio Cesar Mendoza Bobadilla

Institute of Computing
University of Campinas

January 11, 2019


Definitions

Representation Learning. ”A set of methods that allows a


machine to be fed with raw data and to automatically discover the
representations needed for some task” 1
Deep Learning. ”Representation learning methods with multiple
levels of representation, obtained by composing simple but
nonlinear modules that each transform the representation at one
level into a higher, slightly more abstract one” 1

1
Lecun et al., Deep learning, 2015
Traditional Machine Learning Pipeline

I Traditional machine learning algorithms are based on


high-level attributes or features.
I This requires feature engineering.
Representation Learning Pipeline
I Learn features and classifier directly from raw data
I Also known as end-to-end learning.
Deep learning
I Learning a hierarchy of representations that build on each
other, from simple to complex
Multilayer Perceptron
I Pre-activation function for the k layer (h(0) (x) = x)
a(k) (x) = b (k) + W(k) h(k−1) (x)
I Hidden layer activation
h(k) (x) = g (a(k) (x))
I Ouput layer activation
h(L+1) (x) = o(a(L+1) (x) = f (x)

b (1) b (2) b (3)

x1

x2

x3
W(3)
x W(1) h(1) (x) W(2) h(2) (x)
Activation function

0.8
Sigmoid activation function
0.6
I Puts pre-activations between
0.4
0 and 1
I Always positive 0.2

I Bounded 0
I Strictly increasing −4 −2 0 2 4

1
g (x) =
(1 + exp(−x)
Activation function

5
Rectified linear activation
function 4
I Puts pre-activations between 3
0 and 1
2
I Always non negative
1
I Not upper bounded
I Strictly increasing 0
−4 −2 0 2 4
I Tends to give neurons with
sparse activities
g (x) = max(0, a)
Activation function
Softmax activation function
exp(ac )
softmax(ac ) = PC
1 exp(ai )

I Used for multiclass classification.


I Each activation is the conditional probability of the c-class
p(y = c|x)
" #T
exp(a1 ) exp(aC )
o(a) = softmax(a) = PC . . . PC
1 exp(ai ) 1 exp(ai )

I Strictly positive
I Sums to one
I Predicted class is the activation with highest estimated
probability.
Forward propagation

I Can be represented as an
acyclic flow graph.
I The output of each box can
be calculated given its
parents.

2
Hugo Lachorelle, Neural Networks, 2017
Optimization

We want to find the x ∈ θ such


that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization

We want to find the x ∈ θ such


that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is negative, go left!
Optimization

We want to find the x ∈ θ such


that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization

We want to find the x ∈ θ such


that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
Optimization

We want to find the x ∈ θ such


that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
I General principle of gradient
descent optimization:
δf
v =u−α
δu
α is the learning rate
Empirical risk minimization

I Framework of design learning algorithms


1 X
argminθ L(f (x(t) ; θ), y (t) +λ Ω(θ)
T t | {z } |{z}
Loss function Regularizer

I We should optimize classification error, but it is not smooth


I Loss function is a surrogate for what we truly should optimize
Stochastic Gradient Descent

I Performs updates after each training example.

procedure StochasticGradientDescent
require: Network parameters θ
Initialize θ . (θ = W(1) , b (1) , ..., W(L+1) , b (L+1) ) )
for N epochs do
for each training sample (x (t) , y (t) ) do
∆ = −∇θ L(f (x(t) ; θ), y (t) ) − λ∇θ Ω(θ)
θ = θ + α∆

I To apply SGD algorithm we require:


I The loss function L(f (x(t) ; θ), y (t) )
I The regularizer Ω(θ)
I A procedure to compute the parameter gradients
∇θ L(f (x(t) ; θ), y (t) ) and λ∇θ Ω(θ)
I A procedure to initialize the parameters
Loss function

I On classification problems neural networks estimate:


f (x)c = p(y = c|x)
I Cross entropy minimize the negative log-likelihood.
X
L(f (x), y ) = − 1(y =c) log f (x)c (1)
c

I log is used for numerical stability and math simplicity.


Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
g ← ∇f (x) J = ∇f (x) L(f (x), y )
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
g = ∇a(x)(k) J = g f 0 (a(x)(k) )

Gradient on the bias of the k-layer


∇b(k) J = g + λ∇b(k) Ω(θ)

Gradient on the weights of the k-layer


∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations


g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer


∇b(k) J = g + λ∇b(k) Ω(θ)

Gradient on the weights of the k-layer


∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations


g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer


∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer


∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations


g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer


∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer


∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P 
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations


g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer


∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer


∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P 
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations


∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P 
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer


∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer


∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P 
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations


∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P 
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Backward propagation

I Each node computes the


gradient of the loss with
respect to the node term.
I We first calculate the
gradient of the upper nodes.
I Allows automatic
differentiation.
Activation function

0.8

0.6
Sigmoid activation function
0.4
I Partial derivative:
0.2
0
g (a) = g (a)(1 − g (a)) 0
−4 −2 0 2 4

1
g (x) =
(1 + exp(−x)
Activation function

Rectified linear activation 3


function 2
I Partial derivative:
1
0
g (a) = 1a>0 0
−4 −2 0 2 4

g (x) = max(0, a)
Regularization

I L2 regularization.
X X X (k)
Ω(θ) = (Wi,j )2
k i j

I Gradient:
∇W(k) Ω(θ) = 2W(k)
Applications

I Develop statistical models that can discover underlying


structure, semantic relations, constraints from data.
I Applications:
I Speech Recognition
I Computer Vision
I Recommendation Systems
I Language Understanding
I Robotics
Applications

S-ar putea să vă placă și