Documente Academic
Documente Profesional
Documente Cultură
M. Soleymani
Sharif University of Technology
Fall 2017
2
Boltzmann Machine
How a causal model generates data
• In a causal model we generate data in
two sequential steps:
– First pick the hidden states from p(h).
– Then pick the visible states from p(v|h)
This slide has been adopted from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.
How a Boltzmann Machine generates data
• It is not a causal generative model.
𝐸 𝑣, ℎ = −𝑣 𝑇 𝑊ℎ − 𝑐 𝑇 𝑣 − 𝑏 𝑇 ℎ
=− 𝑤𝑖𝑗 𝑣𝑖 ℎ𝑗 − 𝑐𝑖 𝑣𝑖 − 𝑏𝑗 ℎ𝑗
𝑖,𝑗 𝑖 𝑗
Learnable parameters are 𝑏, 𝑐 which are linear weight vectors for 𝑣, ℎ
and 𝑊 which models interaction between them.
Restricted Boltzmann Machines
All hidden units are conditionally independent given the visible units
and vice versa.
7
Restricted Boltzmann Machines
RBM probabilities:
𝑝 𝑣|ℎ = 𝑝 𝑣𝑖 |ℎ
𝑖
𝑝 ℎ|𝑣 = 𝑝 ℎ𝑗 |𝑣
𝑗
𝑝 𝑣𝑖 = 1|ℎ = 𝜎 𝑊𝑖𝑇 ℎ + 𝑏𝑖
𝑝 ℎ𝑗 = 1|𝑣 = 𝜎 𝑊𝑗 𝑣 + 𝑐𝑗
Probabilistic Analog of Autoencoder
• Autoencoder
RBM: Image input
𝒗
MNIST: Learned features
=− 𝑤𝑖𝑗 𝑣𝑖 ℎ𝑗 − 𝑐𝑖 𝑣𝑖 − 𝑏𝑗 ℎ𝑗
𝑖,𝑗 𝑖 𝑗
12
Marginal distribution
𝒗 𝒗 𝒗
𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗
Marginal distribution
𝒗 𝒗 𝒗
𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗 𝒗
𝒗 𝒗
Marginal distribution
𝒗 𝒗 𝒗
𝒗 𝒗
𝒗
Marginal distribution
𝒗 𝒗 𝒗
𝒗 𝒗
𝒗
Model Learning
RBM learning: Stochastic gradient descent
𝜕 (𝑛
𝜕 (𝑛) 𝑇 𝑇 (𝑛)
𝜕 𝑇
log 𝑃(𝒗 )) = log exp𝒗 𝑣 (𝑛) 𝑊ℎ + 𝑐 𝑣 + 𝑏 ℎ −
𝒗 𝜕𝜃 log 𝑍
𝜕𝜃 𝜕𝜃 𝒗 (𝑛)
𝒗
ℎ
Positive phase
Negative phase
𝑍= exp 𝑣 𝑇 𝑊ℎ + 𝑐 𝑇 𝑣 + 𝑏 𝑇 ℎ
𝑣 ℎ
Positive phase
𝜕 (𝑛) 𝑇
log exp 𝑣 𝑊ℎ + 𝑐 𝑇 𝑣 (𝑛) + 𝑏 𝑇 ℎ
𝜕𝑊
ℎ
𝜕 (𝑛) 𝑇
ℎ exp 𝑣 𝑊ℎ + 𝑐 𝑇 𝑣 (𝑛) + 𝑏 𝑇 ℎ
= 𝜕𝑊 𝑇
exp 𝑣 (𝑛) 𝑊ℎ + 𝑐 𝑇 𝑣 (𝑛) + 𝑏 𝑇 ℎ
ℎ
ℎ𝑣 𝑛 𝑇 exp 𝑣 (𝑛) 𝑇
𝑊ℎ + 𝑐 𝑇 𝑣 (𝑛) + 𝑏 𝑇 ℎ
ℎ
= 𝑇
ℎ exp 𝑣 (𝑛) 𝑊ℎ + 𝑐 𝑇 𝑣 (𝑛) + 𝑏 𝑇 ℎ
= 𝐸ℎ~𝑝 ℎ𝑣 𝑛 𝑇
𝑣 𝑛 ,ℎ
RBM learning: Stochastic gradient descent
Maximize with respect to
𝜕
log 𝑃(𝑣 (𝑛 )) = 𝐸ℎ𝑗 𝑣𝑖 ℎ𝑗 |𝑣 = 𝑣 (𝑛) − 𝐸𝑣𝑖,ℎ𝑗 𝑣𝑖 ℎ𝑗
𝜕𝑊𝑖𝑗
𝜕
log 𝑃(𝑣 (𝑛 )) = 𝐸ℎ𝑗 ℎ𝑗 |𝑣 = 𝑣 (𝑛) − 𝐸ℎ𝑗 ℎ𝑗
𝜕𝑏𝑗
𝜕
log 𝑃(𝑣 (𝑛 )) = 𝐸𝑣𝑖 𝑣𝑖 |𝑣 = 𝑣 (𝑛) − 𝐸𝑣𝑖 𝑣𝑖
𝜕𝑐𝑖
20
RBM learning: Stochastic gradient descent
𝜕
log 𝑃(𝑣 (𝑛 )) = 𝐸 𝑣𝑖 ℎ𝑗 |𝑣 = 𝑣 (𝑛) − 𝐸 𝑣𝑖 ℎ𝑗
𝜕𝑊𝑖𝑗
Positive statistic
𝐸 𝑣𝑖 ℎ𝑗 |𝑣 = 𝑣 (𝑛)
𝑛
= 𝐸 ℎ𝑗 |𝑣 = 𝑣 (𝑛) 𝑣𝑖
𝑛
𝑣𝑖
=
𝑛
1 + exp − 𝑖 𝑊𝑖𝑗 𝑣𝑖 + 𝑏𝑗
It can be done by starting at any random state of the visible units and
performing Gibbs sampling for a very long time.
Block-Gibbs MCMC
Initialize v0 = v
Sample h0 from P(h|v0)
For t=1:T
Sample vt from P(v|ht-1)
Sample ht from P(h|vt )
25
Negative statistic
𝑀
1 (𝑚) (𝑚)
𝐸[𝑣𝑖 ℎ𝑗 ] ≈ 𝑣𝑖 ℎ𝑗
𝑀
𝑚=1
(𝑚) (𝑚)
𝑣 ℎ ~𝑃 𝑣, ℎ
• Initializing N independent Markov chain each at a data point and running
until convergence:
𝑁
1 𝑛 ,𝑇 𝑛 ,𝑇
𝐸[𝑣𝑖 ℎ𝑗 ] ≈ 𝑣𝑖 ℎ𝑗
𝑁
𝑛=1
𝑛 ,0 𝑛
𝑣 =𝑣
𝑛 ,𝑘 𝑛 ,𝑘
ℎ ~𝑃(ℎ|𝑣 = 𝑣 ) for 𝑘 ≥ 0
𝑛 ,𝑘 𝑛 ,𝑘−1
𝑣 ~𝑃(𝑣|ℎ = ℎ ) for 𝑘 ≥ 1
Contrastive Divergence
𝒗
𝒗
𝒗(𝑛)
𝑝 𝑣|ℎ = 𝑝 𝑣𝑖 |ℎ
𝒗 𝒗 𝑖
𝑝 ℎ|𝑣 = 𝑝 ℎ𝑗 |𝑣
𝑗
𝑝 𝑣𝑖 = 1|ℎ = 𝜎 𝑊𝑖𝑇 ℎ + 𝑏𝑖
𝒗(𝑛) 𝒗1 𝒗𝑘 = 𝒗
𝑝 ℎ𝑗 = 1|𝑣 = 𝜎 𝑊𝑗 𝑣 + 𝑐𝑗
CD-k Algorithm
• CD-k: contrastive divergence with k iterations of Gibbs sampling
• In general, the bigger k is, the less biased the estimate of the gradient
will be
• In practice, k=1 works well for learning good features and for pre-
training
RBM inference: Block-Gibbs MCMC
29
CD-k Algorithm
• Repeat until stopping criteria
– For each training sample 𝑣 (𝑛)
• Generate a negative sample 𝑣 using k steps of Gibbs sampling starting at
the point 𝑣 𝑛
• Update model parameters:
𝜕 log 𝑝 𝑣 (𝑛) 𝑛 𝑛 𝑇
• 𝑊 ←𝑊+𝛼 =𝑊+𝛼 ℎ 𝑣 𝑣 − ℎ 𝑣 𝑣𝑇
𝜕𝑊
𝑛
• 𝑏 ←𝑏+𝛼 ℎ 𝑣 −ℎ 𝑣
• 𝑐 ←𝑐+𝛼 𝑣 𝑛 −𝑣
Positive phase vs. negative phase
Contrastive Divergence
35
Gaussian Bernoulli RBMs
Gaussian Bernoulli RBMs
• Let x represent a real-valued (unbounded) input, add a quadratic
term to the energy function
𝑇 𝑇 𝑇
1 𝑇
𝐸 𝑣, ℎ = 𝑣 𝑊ℎ + 𝑐 𝑣 + 𝑏 ℎ + 𝑣 𝑣
2
• In this case 𝑝 𝑣|ℎ becomes a Gaussian distribution with mean 𝜇 = 𝑐
+ 𝑊 𝑇 ℎ and identity covariance matrix
• recommend to normalize the training set by:
– subtracting the mean of each input
– dividing each input by the training set standard deviation
• should use a smaller learning rate than in the regular RBM
Gaussian Bernoulli RBMs
Gaussian Bernoulli RBMs
Deep Belief Networks
• one of the first non-convolutional models to successfully admit
training of deep architectures (2007)
Pre-training
• We will use a greedy, layer-wise procedure
• Train one layer at a time with unsupervised criterion
• Fix the parameters of previous hidden layers
• Previous layers viewed as feature extraction
Pre-training
• Unsupervised Pre-training
– first layer: find hidden unit features that are more common in training inputs
than in random inputs
– second layer: find combinations of hidden unit features that are more
common than random hidden unit features
– third layer: find combinations of combinations of ...
• First construct a standard RBM with visible layer 𝑣 and hidden layer ℎ
• Stack another hidden layer on top of the RBM to form a new RBM
• Fix W1 , sample ℎ1 from 𝑝 ℎ1 |𝑣 as input.
• Train W2 as an RBM.
𝑝 ℎ1 |𝑣
• Stack a third hidden layer on top of the RBM to form a new RBM
• Fix W1 , W2 sample ℎ2 from 𝑝 ℎ2 |ℎ1 as input.
• Train W3 as RBM.
• Continue…
𝑝 ℎ2 |ℎ1
𝑝 ℎ1 |𝑣
Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, 2008.
Deep Belief Network
Lee et al., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, ICML 2009.
Sampling from DBN
Deep Belief Networks
DBM
63
Shallow and Deep architectures
• Modeling high-order and long-range interactions
RBM
DBM
64
Deep Boltzmann Machines
65
Deep Boltzmann Machines
Conditional distributions remain factorized due to layering.
66
Deep Boltzmann Machines
Conditional distributions:
Deep Boltzmann Machines
DBM
• Probabilistic
• Generative DBM’s have the potential of learning internal representations
that become increasingly complex at higher layers
• Powerful
69
DBM: Learning
DBM: Learning
DBM: Learning
Experiments