Sunteți pe pagina 1din 91

Approximate Inference: Sampling Methods (2)

Piyush Rai

Probabilistic Machine Learning (CS772A)

Oct 3, 2017

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 1
Sampling Methods: Recap

Any probability distribution p(z) can be (approximately) represented using a set of samples

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 2
Sampling Methods: Recap

Any probability distribution p(z) can be (approximately) represented using a set of samples

Samples can come from p(z) or some proposal distribution if p(z) is a difficult distribution

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 2
Sampling Methods: Recap

Any probability distribution p(z) can be (approximately) represented using a set of samples

Samples can come from p(z) or some proposal distribution if p(z) is a difficult distribution
Given a set of samples {z (`) }L`=1 , the sample-based approximation of p(z) can be written as
L L
1X 1X
p(z) (z = z (`) ) or p(z) z (`) (z)
L L
`=1 `=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 2
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution

Also, using the samples to approximate difficult to compute expectations

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution
Transformation based methods
Rejection sampling

Also, using the samples to approximate difficult to compute expectations

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution
Transformation based methods
Rejection sampling

Also, using the samples to approximate difficult to compute expectations


L
1X
Z
(`) (`) L
Monte Carlo Sampling: Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 p(z)
L `=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution
Transformation based methods
Rejection sampling

Also, using the samples to approximate difficult to compute expectations


L
1X
Z
(`) (`) L
Monte Carlo Sampling: Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 p(z)
L `=1
L (`)
1X (`) p(z )
Z
(`) L
Important Sampling (1): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
L `=1 q(z (`) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution
Transformation based methods
Rejection sampling

Also, using the samples to approximate difficult to compute expectations


L
1X
Z
(`) (`) L
Monte Carlo Sampling: Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 p(z)
L `=1
L (`)
1X (`) p(z )
Z
(`) L
Important Sampling (1): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
L `=1 q(z (`) )
L (`)
Zq 1 X (`) p(z )
Z
(`) L
Important Sampling (2): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
Zp L `=1 q(z (`) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Sampling Methods: Recap

Looked at some basic methods for generating samples from a probability distribution
Transformation based methods
Rejection sampling

Also, using the samples to approximate difficult to compute expectations


L
1X
Z
(`) (`) L
Monte Carlo Sampling: Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 p(z)
L `=1
L (`)
1X (`) p(z )
Z
(`) L
Important Sampling (1): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
L `=1 q(z (`) )
L (`)
Zq 1 X (`) p(z )
Z
(`) L
Important Sampling (2): Ep(z) [f (z)] = f (z)p(z)dz f (z ) where {z }`=1 q(z)
Zp L `=1 q(z (`) )

p(z)
[Note: I.S. (1) assumes p(z) can be evaluated at any z, I.S. (2) assumes p(z) = Zp can only be evaluated up to a prop. constant]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 3
Limitations of Basic Sampling Methods

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Limitations of Basic Sampling Methods

Transformation based methods: Usually limited to drawing from standard distributions

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Limitations of Basic Sampling Methods

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Limitations of Basic Sampling Methods

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Difficult to find good prop. distr. especially when z is high-dim. (e.g., models with many params)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Limitations of Basic Sampling Methods

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Difficult to find good prop. distr. especially when z is high-dim. (e.g., models with many params)
In high dimensions, most of the mass of p(z) is concentrated in a tiny region of the z space

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Limitations of Basic Sampling Methods

Transformation based methods: Usually limited to drawing from standard distributions

Rejection Sampling and Importance Sampling: Require good proposal distributions

Difficult to find good prop. distr. especially when z is high-dim. (e.g., models with many params)
In high dimensions, most of the mass of p(z) is concentrated in a tiny region of the z space
Difficult to a priori know what those regions are, thus difficult to come up with good proposal dist.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 4
Markov Chain Monte Carlo (MCMC) Methods

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 5
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z
Accept/reject z as the next sample based on an acceptance criterion (will see later)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z
Accept/reject z as the next sample based on an acceptance criterion (will see later)
If accepted, z (`+1) = z . If rejected, z (`+1) = z (`)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z
Accept/reject z as the next sample based on an acceptance criterion (will see later)
If accepted, z (`+1) = z . If rejected, z (`+1) = z (`)

If q(z|z (`) ) has certain properties, the Markov chains stationary distribution will be p(z)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

p(z)
Goal: Generate samples from some target distribution p(z) = Z , where z is high-dimensional
Will again assume that we can evaluate p(z) at least up to a proportionality constant

Basic idea in MCMC: Use a Markov Chain to generate samples from p(z)

z (1) z (2) . . . z (L)

How: Given a current sample z (`) from the chain, generate the next sample z (`+1) as
Use a proposal distribution q(z|z (`) ) to generate a candidate sample z
Accept/reject z as the next sample based on an acceptance criterion (will see later)
If accepted, z (`+1) = z . If rejected, z (`+1) = z (`)

If q(z|z (`) ) has certain properties, the Markov chains stationary distribution will be p(z)
Informally, stationary distribution means where the chain will eventually reach

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 6
Markov Chain Monte Carlo (MCMC)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 7
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following


An initial state distribution p(z (0) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following


An initial state distribution p(z (0) )
Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following


An initial state distribution p(z (0) )
Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) ): Distribution over the possible values of z (`+1)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following


An initial state distribution p(z (0) )
Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) ): Distribution over the possible values of z (`+1)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Markov Chain

Consider a sequence of random variables z (1) , . . . , z (L)


A first-order Markov Chain assumes

p(z (`+1) |z (1) , . . . , z (`) ) = p(z (`+1) |z (`) ) `

A first order Markov chain can be defined by the following


An initial state distribution p(z (0) )
Transition probabilities T` (z (`) , z (`+1) ) = p(z (`+1) |z (`) ): Distribution over the possible values of z (`+1)

Homogeneous Markov Chain: Transition probabilities T` = T (same everywhere along the chain)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 8
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Assume the graph representing the possible state transitions is irreducible and aperiodic

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Assume the graph representing the possible state transitions is irreducible and aperiodic
Ergodic Property: Under the above assumption, for any choice of an initial probability vector v

v T m = p as m

where the probability vector p represents the invariant or stationary distribution of the chain

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Assume the graph representing the possible state transitions is irreducible and aperiodic
Ergodic Property: Under the above assumption, for any choice of an initial probability vector v

v T m = p as m

where the probability vector p represents the invariant or stationary distribution of the chain
Why do we need the graph to be irreducible and aperiodic?

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Assume the graph representing the possible state transitions is irreducible and aperiodic
Ergodic Property: Under the above assumption, for any choice of an initial probability vector v

v T m = p as m

where the probability vector p represents the invariant or stationary distribution of the chain
Why do we need the graph to be irreducible and aperiodic?
Irreducible: No disjoint sets of nodes. Can reach from any state to any state

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Consider the discrete case when z has K possible states (and T is a matrix of size K K )
Assume the graph representing the possible state transitions is irreducible and aperiodic
Ergodic Property: Under the above assumption, for any choice of an initial probability vector v

v T m = p as m

where the probability vector p represents the invariant or stationary distribution of the chain
Why do we need the graph to be irreducible and aperiodic?
Irreducible: No disjoint sets of nodes. Can reach from any state to any state
Aperiodic: No cycles in the graph (otherwise would oscillate forever). Consider this example
 
0 1
v = [1/5, 4/5] T =
1 0

.. multiplying v by T repeatedly leads to oscillating values without convergence

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 9
Some Properties
Note that, for Ergodic case, p T = p

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1

For the continuous z case, we can equivalently write (for any two state values z and z 0 )
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1

For the continuous z case, we can equivalently write (for any two state values z and z 0 )
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1

For the continuous z case, we can equivalently write (for any two state values z and z 0 )
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Integrating both sides w.r.t. z 0 gives p (z 0 )T (z 0 , z)dz 0 = p (z) (i.e., the Ergodic property)
R

Thus a Markov chain with detailed balance will always converge to a stationary distribution p (z)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
Some Properties
Note that, for Ergodic case, p T = p . Therefore is the left eigenvector of T with eigenvalue 1
For the discrete-valued z with K = 5 possible states, p = [p1 , . . . , p5 ], and we can write
X 5
pi Tij = pj
i=1

For the continuous z case, we can equivalently write (for any two state values z and z 0 )
Z
p (z 0 )T (z 0 , z)dz 0 = p (z)

Suppose a Markov chain with transition probabilities T satisfies

p (z)T (z, z 0 ) = p (z 0 )T (z 0 , z) (Detailed Balance)

Integrating both sides w.r.t. z 0 gives p (z 0 )T (z 0 , z)dz 0 = p (z) (i.e., the Ergodic property)
R

Thus a Markov chain with detailed balance will always converge to a stationary distribution p (z)
Homogeneous Markov Chains satisfy detailed balance/ergodic property under mild conditions
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 10
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated

Repeat the same for a total of S times

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated

Repeat the same for a total of S times


In the end, we have S i.i.d. samples from p(z), i.e., z (T1 ) , z (T2 ) , . . . , z (TS ) p(z)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated

Repeat the same for a total of S times


In the end, we have S i.i.d. samples from p(z), i.e., z (T1 ) , z (T2 ) , . . . , z (TS ) p(z)
Note: Good choices for T1 and Ti Ti1 are usually based on heuristics

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
MCMC: The Basic Scheme

Running the MCMC chain infinitely long gives us ONE sample from the target distribution

But we usually require several samples to approximate the distribution. How do we get those?

Start at an initial z (0) . Using a prop. dist. p(z (`+1) |z (`) ), run the chain long enough, say T1 steps
Discard the first (T1 1) samples (called burn-in samples) and take the last sample z (T1 )
Continue from z (T1 ) up to T2 steps, discard intermediate samples, take the last sample z (T2 )
This helps ensure that z (T1 ) and z (T2 ) are uncorrelated

Repeat the same for a total of S times


In the end, we have S i.i.d. samples from p(z), i.e., z (T1 ) , z (T2 ) , . . . , z (TS ) p(z)
Note: Good choices for T1 and Ti Ti1 are usually based on heuristics
Note: MCMC is an approximate method because we dont usually know what T1 is long enough

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 11
Some MCMC Sampling Algorithms

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 12
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
It favors accepting z if p(z ) has a higher value than p(z ( ) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
It favors accepting z if p(z ) has a higher value than p(z ( ) )
Unfavors z if the proposal distribution q unduly favors it (i.e., if q(z |z ( ) ) is large)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
It favors accepting z if p(z ) has a higher value than p(z ( ) )
Unfavors z if the proposal distribution q unduly favors it (i.e., if q(z |z ( ) ) is large)
Favors z if we can reverse to z ( ) from z (i.e., if q(z ( ) |z ) is large). Needed for good mixing

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
It favors accepting z if p(z ) has a higher value than p(z ( ) )
Unfavors z if the proposal distribution q unduly favors it (i.e., if q(z |z ( ) ) is large)
Favors z if we can reverse to z ( ) from z (i.e., if q(z ( ) |z ) is large). Needed for good mixing

Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
Metropolis-Hastings (MH) Sampling
Assume a proposal distribution q(z|z ( ) ), e.g., N (z|z ( ) , 2 ID )
In each step, draw z q(z|z ( ) ) and accept the sample z with probability

p(z )q(z ( ) |z )
 
A(z , z ( ) ) = min 1,
p(z ( ) )q(z |z ( ) )
The acceptance probability makes intuitive sense. Note the kind of z would it favor/unfavor:
It favors accepting z if p(z ) has a higher value than p(z ( ) )
Unfavors z if the proposal distribution q unduly favors it (i.e., if q(z |z ( ) ) is large)
Favors z if we can reverse to z ( ) from z (i.e., if q(z ( ) |z ) is large). Needed for good mixing

Transition probability of the Markov chain in MH sampling: T (z, z ( ) ) = A(z , z ( ) )q(z|z ( ) )


Exercise: Show that T (z, z ( ) ) satisfies the detailed balance property

T (z, z ( ) )p(z) = T (z ( ) , z)p(z ( ) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 13
MH Sampling in Action: A Toy Example..
      
4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 14
MH Sampling in Action: A Toy Example..
      
4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 14
MH Sampling in Action: A Toy Example..
      
4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 14
MH Sampling in Action: A Toy Example..
      
4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 14
MH Sampling in Action: A Toy Example..
      
4 1 2 0.01 0
Target p(z) = N , , Proposal q(z (t) |z (t1) ) = N z (t1) ,
4 3 4 0 0.01

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 14
MH Sampling: Some Comments

Special Case: If proposal distrib. is symmetric, we get Metropolis Sampling algorithm with

p(z )
 
( )
A(z , z ) = min 1,
p(z ( ) )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 15
MH Sampling: Some Comments

Special Case: If proposal distrib. is symmetric, we get Metropolis Sampling algorithm with

p(z )
 
( )
A(z , z ) = min 1,
p(z ( ) )
Limitation: MH can have a very slow convergence

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 15
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order
Gibbs sampling is equivalent to Metropolis Hastings sampling with acceptance prob. = 1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order
Gibbs sampling is equivalent to Metropolis Hastings sampling with acceptance prob. = 1

p(z )q(z|z )
A(z , z) =
p(z)q(z |z)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order
Gibbs sampling is equivalent to Metropolis Hastings sampling with acceptance prob. = 1

p(z )q(z|z ) p(zi |z i )p(z i )p(zi |z i )


A(z , z) = =

p(z)q(z |z) p(zi |z i )p(z i )p(zi |z i )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling (Geman & Geman, 1984)

Suppose we wish to sample from a joint distribution p(z) where z = (z1 , z2 , . . . , zM )


However, suppose we cant sample from p(z) but can sample from each conditional p(zi |z i )
Can we done easily if we have a locally conjugate model (e.g., Gaussian matrix factorization)

Gibbs sampling uses the conditionals p(zi |z i ) as the proposal distribution


Gibbs sampling samples from these conditionals in a cyclic order
Gibbs sampling is equivalent to Metropolis Hastings sampling with acceptance prob. = 1

p(z )q(z|z ) p(zi |z i )p(z i )p(zi |z i )


A(z , z) = = =1

p(z)q(z |z) p(zi |z i )p(z i )p(zi |z i )

where we use the fact that z i = z i

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 16
Gibbs Sampling: Sketch of the Algorithm
M: Total number of variables, T : number of Gibbs sampling steps

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 17
Gibbs Sampling: Sketch of the Algorithm
M: Total number of variables, T : number of Gibbs sampling steps

Note: When sampling each variable from its conditional posterior, we use the most recent values of all
other variables (this is akin to a co-ordinate ascent like procedure)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 17
Gibbs Sampling: Sketch of the Algorithm
M: Total number of variables, T : number of Gibbs sampling steps

Note: When sampling each variable from its conditional posterior, we use the most recent values of all
other variables (this is akin to a co-ordinate ascent like procedure)
Note: Order of updating the variables usually doesnt matter (but see Scan Order in Gibbs Sampling: Models in
Which it Matters and Bounds on How Much from NIPS 2016)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 17
Gibbs Sampling: A Simple Example
Can sample from a 2-D Gaussian using 1-D Gaussians (recall that if the joint distribution is a 2-D
Gaussian, conditionals will simply be 1-D Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 18
Gibbs Sampling: A Simple Example
Can sample from a 2-D Gaussian using 1-D Gaussians (recall that if the joint distribution is a 2-D
Gaussian, conditionals will simply be 1-D Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 18
Gibbs Sampling: A Simple Example
Can sample from a 2-D Gaussian using 1-D Gaussians (recall that if the joint distribution is a 2-D
Gaussian, conditionals will simply be 1-D Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 18
Gibbs Sampling: A Simple Example
Can sample from a 2-D Gaussian using 1-D Gaussians (recall that if the joint distribution is a 2-D
Gaussian, conditionals will simply be 1-D Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 18
Next Class..

More examples of Gibbs sampling


Random-walk avoiding MCMC methods
Using MCMC. Pros and Cons.
Some recent advances in MCMC

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Approximate Inference: Sampling Methods (2) 19

S-ar putea să vă placă și