PML Lec15 Slides

Approximate Inference: Sampling Methods (2)
Piyush Rai
Probabilistic Machine Learning (CS772A)
Oct 3, 2017
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 1
Sampling Methods: Recap
Any probability distribution p(z) can be (approximately) represented using a set of samples
Samples can come from p(z) or some proposal distribution if p(z) is a difficult distribution
Samples can come from p(z) or some proposal distribution if p(z) is a difficult distribution
Given a set of samples {z (`) }L`=1 , the sample-based approximation of p(z) can be written as
L L
1X 1X
p(z) (z = z (`) ) or p(z) z (`) (z)
L L
`=1 `=1
Looked at some basic methods for generating samples from a probability distribution
Also, using the samples to approximate difficult to compute expectations
Transformation based methods
Rejection sampling
Rejection sampling

L
1X
Z
(`) (`) L
Monte Carlo Sampling: Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 p(z)
L `=1
Rejection sampling

L
1X
Z
(`) (`) L
L `=1
L (`)
1X (`) p(z )
Z
(`) L
Important Sampling (1): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
L `=1 q(z (`) )
Rejection sampling

L
1X
Z
(`) (`) L
L `=1
L (`)
1X (`) p(z )
Z
(`) L
L `=1 q(z (`) )
L (`)
Zq 1 X (`) p(z )
Z
(`) L
Zp L `=1 q(z (`) )
Rejection sampling

L
1X
Z
(`) (`) L
L `=1
L (`)
1X (`) p(z )
Z
(`) L
L `=1 q(z (`) )
L (`)
Zq 1 X (`) p(z )
Z
(`) L
Zp L `=1 q(z (`) )
p(z)
[Note: I.S. (1) assumes p(z) can be evaluated at any z, I.S. (2) assumes p(z) = Zp can only be evaluated up to a prop. constant]
Limitations of Basic Sampling Methods
Transformation based methods: Usually limited to drawing from standard distributions
Rejection Sampling and Importance Sampling: Require good proposal distributions
Difficult to find good prop. distr. especially when z is high-dim. (e.g., models with many params)
In high dimensions, most of the mass of p(z) is concentrated in a tiny region of the z space
In high dimensions, most of the mass of p(z) is concentrated in a tiny region of the z space
Difficult to a priori know what those regions are, thus difficult to come up with good proposal dist.
Markov Chain Monte Carlo (MCMC) Methods
Markov Chain Monte Carlo (MCMC)
p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
p(z)
Will again assume that we can evaluate p(z) at least up to a proportionality constant
p(z)
Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)
z (1) z (2) . . . z (L)
p(z)
z (1) z (2) . . . z (L)
How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
p(z)
z (1) z (2) . . . z (L)
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z
p(z)
z (1) z (2) . . . z (L)
Accept/reject z as the next sample based on an acceptance criterion (will see later)
p(z)
z (1) z (2) . . . z (L)
If accepted, z (`+1) = z . If rejected, z (`+1) = z (`)
p(z)
z (1) z (2) . . . z (L)
If q(z|z (`) ) has certain properties, the Markov chains stationary distribution will be p(z)
p(z)
z (1) z (2) . . . z (L)
If q(z|z (`) ) has certain properties, the Markov chains stationary distribution will be p(z)
Informally, stationary distribution means where the chain will eventually reach
Markov Chain
Consider a sequence of random variables z (1) , . . . , z (L)
Markov Chain

A first-order Markov Chain assumes
p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `
Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `
A first order Markov chain can be defined by the following
Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

An initial state distribution p(z (0) )
Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) )
Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) ): Distribution over the possible values of z (`+1)
Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Markov Chain

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Homogeneous Markov Chain: Transition probabilities T` = T (same everywhere along the chain)
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Some Properties
Assume the graph representing the possible state transitions is irreducible and aperiodic
Some Properties
Ergodic Property: Under the above assumption, for any choice of an initial probability vector v
v T m = p as m
where the probability vector p represents the invariant or stationary distribution of the chain
Some Properties
v T m = p as m
Why do we need the graph to be irreducible and aperiodic?
Some Properties
v T m = p as m
Irreducible: No disjoint sets of nodes. Can reach from any state to any state
Some Properties
v T m = p as m
Irreducible: No disjoint sets of nodes. Can reach from any state to any state
Aperiodic: No cycles in the graph (otherwise would oscillate forever). Consider this example

0 1
v = [1/5, 4/5] T =
1 0
.. multiplying v by T repeatedly leads to oscillating values without convergence
Some Properties
Note that, for Ergodic case, p T = p
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
Some Properties
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1
Some Properties
X 5
pi Tij = pj
i=1
For the continuous z case, we can equivalently write (for any two state values z and z 0 )
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)
Some Properties
X 5
pi Tij = pj
i=1
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)
Suppose a Markov chain with transition probabilities T satisfies
p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)
Some Properties
X 5
pi Tij = pj
i=1
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)
Integrating both sides w.r.t. z 0 gives p (z 0 )T (z 0 , z)dz 0 = p (z) (i.e., the Ergodic property)
R
Thus a Markov chain with detailed balance will always converge to a stationary distribution p (z)
Some Properties
X 5
pi Tij = pj
i=1
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)
Integrating both sides w.r.t. z 0 gives p (z 0 )T (z 0 , z)dz 0 = p (z) (i.e., the Ergodic property)
R
Thus a Markov chain with detailed balance will always converge to a stationary distribution p (z)
Homogeneous Markov Chains satisfy detailed balance/ergodic property under mild conditions
MCMC: The Basic Scheme
Running the MCMC chain infinitely long gives us ONE sample from the target distribution
But we usually require several samples to approximate the distribution. How do we get those?
Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated
Repeat the same for a total of S times

In the end, we have S i.i.d. samples from p(z), i.e., z (T1 ) , z (T2 ) , . . . , z (TS ) p(z)

Note: Good choices for T1 and Ti Ti1 are usually based on heuristics

Note: Good choices for T1 and Ti Ti1 are usually based on heuristics
Note: MCMC is an approximate method because we dont usually know what T1 is long enough
Some MCMC Sampling Algorithms
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
It favors accepting z if p(z ) has a higher value than p(z ( ) )
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
Unfavors z if the proposal distribution q unduly favors it (i.e., if q(z |z ( ) ) is large)
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
Favors z if we can reverse to z ( ) from z (i.e., if q(z ( ) |z ) is large). Needed for good mixing
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )
p(z )q(z ( ) |z )

A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )

Exercise: Show that T (z, z ( ) ) satisfies the detailed balance property
T (z, z ( ) )p(z) = T (z ( ) , z)p(z ( ) )
MH Sampling in Action: A Toy Example..

4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

4 1 2 0.01 0
4 3 4 0 0.01

4 1 2 0.01 0
4 3 4 0 0.01

4 1 2 0.01 0
4 3 4 0 0.01

4 1 2 0.01 0
4 3 4 0 0.01
MH Sampling: Some Comments
Special Case: If proposal distrib. is symmetric, we get Metropolis Sampling algorithm with
p(z )

( )
A(z , z ) = min 1,
p(z ( ) )
MH Sampling: Some Comments
Special Case: If proposal distrib. is symmetric, we get Metropolis Sampling algorithm with
p(z )

( )
A(z , z ) = min 1,
p(z ( ) )
Limitation: MH can have a very slow convergence
Gibbs Sampling (Geman & Geman, 1984)
Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )

However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )

Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order


Gibbs sampling is equivalent to Metropolis Hastings sampling with acceptance prob. = 1


p(z )q(z|z )
A(z , z) =
p(z)q(z |z)


p(z )q(z|z ) p(zi |z i )p(z i )p(zi |z i )

A(z , z) = =

p(z)q(z |z) p(zi |z i )p(z i )p(zi |z i )


p(z )q(z|z ) p(zi |z i )p(z i )p(zi |z i )

A(z , z) = = =1

p(z)q(z |z) p(zi |z i )p(z i )p(zi |z i )
where we use the fact that z i = z i
Gibbs Sampling: Sketch of the Algorithm
M: Total number of variables, T : number of Gibbs sampling steps
Note: When sampling each variable from its conditional posterior, we use the most recent values of all
other variables (this is akin to a co-ordinate ascent like procedure)
Note: When sampling each variable from its conditional posterior, we use the most recent values of all
other variables (this is akin to a co-ordinate ascent like procedure)
Note: Order of updating the variables usually doesnt matter (but see Scan Order in Gibbs Sampling: Models in
Which it Matters and Bounds on How Much from NIPS 2016)
Gibbs Sampling: A Simple Example
Can sample from a 2-D Gaussian using 1-D Gaussians (recall that if the joint distribution is a 2-D
Gaussian, conditionals will simply be 1-D Gaussians)
Next Class..
More examples of Gibbs sampling

Random-walk avoiding MCMC methods
Using MCMC. Pros and Cons.
Some recent advances in MCMC

PML Lec15 Slides

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

PML Lec15 Slides

Încărcat de

Drepturi de autor:

Formate disponibile

Approximate Inference: Sampling Methods (2)

Probabilistic Machine Learning (CS772A)

Also, using the samples to approximate difficult to compute expectations

Also, using the samples to approximate difficult to compute expectations

Also, using the samples to approximate difficult to compute expectations

Also, using the samples to approximate difficult to compute expectations

Also, using the samples to approximate difficult to compute expectations

Also, using the samples to approximate difficult to compute expectations

Transformation based methods: Usually limited to drawing from standard distributions

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

z (1) z (2) . . . z (L)

Consider a sequence of random variables z (1) , . . . , z (L)

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Consider a sequence of random variables z (1) , . . . , z (L)

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

.. multiplying v by T repeatedly leads to oscillating values without convergence

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Repeat the same for a total of S times

Repeat the same for a total of S times

Repeat the same for a total of S times

Repeat the same for a total of S times

Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )

Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )

T (z, z ( ) )p(z) = T (z ( ) , z)p(z ( ) )

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )