Sunteți pe pagina 1din 39

Lecture 4: Inference, Asymptotics & Monte

Carlo

August 11, 2018

1/39
Outline

1. Posterior Inference
• Loss functions, predictive inference

2. Posterior Asymptotics

3. Monte Carlo methods


• Importance sampling.

2/39
Outline

1. Posterior Inference
• Loss functions, predictive inference

2. Posterior Asymptotics

3. Monte Carlo methods


• Importance sampling.

3/39
Summarising Posterior Information
The posterior π(θ | x) is a complete summary of the inference about θ.
In some sense π(θ | x) is the inference.

However, for many applications we wish to summarise this information to


make a decision.
1. What is the “best” point estimate θ̂ of θ?
E.g. Posterior mean, median, mode etc.

2. What decision d ∈ {d1 , d2 , . . .} = D is the optimal choice to make,


given knowledge of θ? E.g.
• How many loaves of bread to bake per day to maximise profit?
– balancing baking costs and number of loaves sold/wasted.
• How high to build sea walls to minimise cost?
– balancing construction cost, chance of breach and resulting
damages.
• How much local rail infrastructure to build?
– balancing construction costs versus improvement to economy
and other benefits.

4/39
Loss Functions
How do we define “best” or optimal?

Loss Function
For a prediction d ∈ D, a loss function

`(θ, d)

defines the penalty in taking decision d given (fixed) parameter value θ.


I Negative loss is a gain, and is beneficial.
I Sometimes expressed as maximising utility −`(θ, d).

The premise is then:


I Choose the decision d∗ = argmind `(θ, d) that minimises the loss.

However, θ is not known, but rather θ ∼ π(θ | x). So alternatively

d∗ = argmin Eπ [`(θ, d)] .


d

i.e. choose d which minimises the expected posterior loss.

5/39
Loss functions

A full (decision theoretic) Bayesian setup consists of specifying:


I Prior distribution π(θ)
I Model f (x | θ) (leading to the likelihood L(x | θ))
I Loss function `(θ, d).
Although most people only consider the first two of these.

6/39
Loss functions for estimating θ

We first consider loss functions for parameter estimation:


Given π(θ | x) what is the optimal point estimate of θ? (d = θ̂)

Consider 4 standard loss functions:


I Quadratic loss:
`(θ, d) = (θ − d)2
I Absolute error loss:
`(θ, d) = |θ − d|
I 0-1 loss: 
0 if |d − θ| ≤ 
`(θ, d) =
1 if |d − θ| > 
I Linear loss: 
α(d − θ) if d > θ
`(θ, d) =
β(θ − d) if d < θ
for given α, β > 0.

7/39
Quadratic Loss

Z
Eπ [`(θ, d)] = `(θ, d)π(θ | x)dθ
Z
= (θ − d)2 π(θ | x)dθ
Z
= (θ − E(θ | x) + E(θ | x) − d)2 π(θ | x)dθ
Z Z
= (θ − E(θ | x))2 π(θ | x)dθ + (E(θ | x) − d)2 π(θ | x)dθ
Z
+2 (θ − E(θ | x))(E(θ | x) − d)π(θ | x)dθ

= Var(θ | x) + (E(θ | x) − d)2 + 0

I This is minimised when d = E(θ | x).


I The posterior mean minimises quadratic loss.
I The expected loss is the posterior variance.

8/39
Linear Loss

Linear loss:

α(d − θ) if d > θ
`(θ, d) =
β(θ − d) if d < θ
for given α, β > 0.

For any d, we have E`(X, d) =

= αE(d − X)+ + βE(X − d)+


= αE[d − X; X < d] + βE[X − d; X > d]
= d(αP[X < d] − βP[X > d]) + βE[X; X > d] − αE[X; X < d]
= d((α + β)P[X < d] − β) − (α + β)E[X; X < d] + βEX

Let d∗ be the β/(α + β) quantile, that is,

P[X < d∗ ] = β/(α + β) .

9/39
Linear Loss

Then, for any other d, we have E`(X, d) − E`(X, d∗ ) =

= d((α + β)P[X < d] − β) − (α + β)(E[X; X < d] − E[X; X < d∗ ])


n o
= (α + β) d(P[X < d] − P[X < d∗ ]) + E[X; X < d∗ ] − E[X; X < d]

Hence, if d∗ > d, then

E`(X, d) − E`(X, d∗ )
= E[X; d < X < d∗ ] − dP[d < X < d∗ ]
(α + β)
= P[d < X < d∗ ] (E[X | d < X < d∗ ] − d)
| {z }
≥0

The case for d∗ < d is dealt with similarly, so we obtain the following.

β
So linear loss is minimised at the α+β
posterior quantile.

10/39
Absolute Error Loss

Absolute error loss:


`(θ, d) = |θ − d|

Absolute error loss = Linear loss with α = β = 1.

When α = β = 1, d∗ is the median of the posterior distribution.

⇒ The posterior median minimises absolute error loss.

11/39
0-1 Loss

0-1 loss:

0 if |d − θ| ≤ 
`(θ, d) =
1 if |d − θ| > 

Here

E[`(θ, d)] = P(|θ − d| > )


= 1 − P(|θ − d| ≤ ).

I |θ − d| ≤  defines an interval [θ − , θ + ] of length 2


I To minimise, choose interval with highest probability i.e. high density
region
I Then θ is mid-point of interval with highest probability
I Choosing  arbitrarily small will select the posterior mode for d.

⇒ The posterior mode minimises 0-1 loss.

12/39
Loss functions for making other decisions

Example: Baking loaves of bread


I c = the cost of baking a loaf of bread
I s > c is price loaf sells for
I π(d | x) is the (posterior) distribution of demand for bread
I b ∈ {0, 1, . . .} is the decision (i.e. number of loaves to bake).

Want to maximise expected profit.

Set up (obvious) loss function:


I If baker bakes b loaves with demand d then profit is:

 (s − c)d − c(b − d) for b>d
Profit = (s − c)d for b=d
(s − c)b for b<d

13/39
Loss functions for making other decisions

I Make Profit relative to selling b = d loaves


(i.e. remove (s − c)d from all terms)

 −c(b − d) for b > d
0
Profit = 0 for b = d
(s − c)b − (s − c)d = (s − c)(b − d) for b < d

I Loss = -Profit

c(b − d) for b ≥ d (cost of baking surplus loaves)
`(b, d) =
(s − c)(d − b) for b < d (lost profit from not baking enough loaves)

I This is linear loss with α = c and β = s − c


I Therefore the optimal decision minimising expected posterior loss is
β
the α+β = s−c
s
quantile of π(d | x).

14/39
Outline

1. Posterior Inference
• Loss functions, predictive inference

2. Posterior Asymptotics

3. Monte Carlo methods


• Importance sampling.

15/39
Predictive Inference

Previous focus on parameter estimation (e.g via loss functions).

Common interest in predictions about future observations.

In predictions there are two forms of uncertainty:


1. Uncertainty over the parameter values, which have been estimated
based on data x.
2. Uncertainty due to the fact that any future value is itself a random
event.
In classical statistics, typically predict with f (y | θ̂)
I θ̂ is the MLE - fixed.
I Only accounts for second source of uncertainty
I So predictions are more precise than they should be
I Problem stems from classical assumption of a single true value of θ

16/39
Predictive Inference

Bayesian framework allows for both sources of uncertainty by averaging


over uncertainty of parameter estimates.

The predictive density function of a future observation, Y ,


is
Z
f (y | x) = f (y | θ, x)π(θ | x)dθ.

While simple to express, can sometimes be difficult to compute.

Can use standard conjugate families to give tractable forms for predictive
distribution in some cases.

Is also simple to estimate via Monte Carlo methods.

17/39
Predictive Inference
Example: Binomial model.

Suppose we have X ∼ Binomial(n, θ) with conjugate prior θ ∼ Beta(a, b).


Then we know that

θ | X = x ∼ Beta(a + x, b + n − x).

Now suppose we intend to make N further observations.


Let Y be the number of successes, so Y | θ ∼ Binomial(N, θ). Hence
!
N y
f (y | θ) = θ (1 − θ)N −y .
y

So for y = 0, 1, . . . , N
!
1
θa+x−1 (1 − θ)b+n−x−1
Z
N y
f (y | x) = θ (1 − θ)N −y × dθ
0 y B(a + x, b + n − x)
!
N B(y + a + x, N − y + b + n − x)
= .
y B(a + x, b + n − x)

This is known as the Beta-Binomial distribution.


18/39
Predictive Inference
Example: Poisson model

We have X1 , . . . , Xn ∼ Poisson(θ) with conjugate θ ∼ Gamma(a, b).


Then we know that

θ | X = x ∼ Gamma(a + nx̄n , b + n).

Now Y ∼ Poisson(θ) so that

θy
f (y | θ) = exp(−θ) .
y!
Hence the predictive distribution is

θy (b + n)a+nx̄n a+nx̄n −1
Z
f (y | x) = exp(−θ) θ exp(−(b + n)θ)dθ
y! Γ(a + nx̄n )
! !y  a+nx̄n
y + a + nx̄n − 1 1 1
(Tute 1) = 1− .
y b+n+1 b+n+1

which is the pdf of a NegBin(a + nx̄n , 1/(b + n + 1)) distribution.

19/39
Monte Carlo posterior predictive distributions
How to generate samples from f (y | x)?
Z
f (y | x) = f (y | θ, x)π(θ | x)dθ
Θ

or in machine learning notation with training data τ :

How to generate samples from f (x | τ )?


Z
f (x | τ ) = f (x | θ, τ )π(θ | τ )dθ
Θ

By inspection of formula:
I Obtain posterior samples θ i ∼ π(θ | τ )
I For each θ i we can generate X i ∼ f (x | θ i )
I This gives us joint samples (θ i , X i ) | x ∼ f (x | θ, τ )π(θ | τ )
I To obtain samples from f (y | x), “integrate out” θ
(i.e. discard the θ i values) to leave X i ∼ f (y | x) only
Hugely simpler than calculating exact algebraic expression!
20/39
Outline

1. Posterior Inference
• Loss functions, predictive inference

2. Posterior Asymptotics

3. Monte Carlo methods


• Importance sampling.

21/39
Posterior Asymptotics

What happens to the posterior distribution as n → ∞?

1. Consistency:
If the “true” value of θ = θ0 ,
and if π(θ0 ) 6= 0 (or is non-zero in a neighbourhood of θ0 ),
then with increasing amounts of data (n −→ ∞) the posterior
probability that θ equals (or lies in a neighbourhood of) θ0 −→ 1.

Property akin to classical notion of ‘consistency’.

2. Asymptotic Normality for θ ∈ Rd :


As n −→ ∞ then

π(θ | x) −→ N(θ 0 , [I(θ 0 )]−1 /n).

All the following arguments are heuristic/outline proofs.

22/39
Posterior Asymptotics

1. Consistency:
Let X1 , . . . , Xn ∼ f (x | θ0 ) be iid observations and suppose the prior is such
that π(θ0 ) 6= 0.

Then the posterior is


n
Y
π(θ | X n ) ∝ π(θ) f (Xi | θ)
i=1
Pn
= π(θ) exp( ln f (Xi | θ))
i=1
n
Y
= π(θ) exp(−nDn (θ)) f (Xi | θ0 )
i=1

1
Pn f (Xi | θ0 )
where Dn (θ) = n i=1 ln f (Xi | θ)
.

23/39
Posterior Asymptotics
For fixed θ, Dn (θ) is the average of n iid random variables,
so converges in probability to its expectation (law of large numbers)
Z  
f (x | θ0 )
E[Dn (θ)] = f (x | θ0 ) ln dx := D(θ) .
f (x | θ)

The RHS is the Kullback-Leibler distance between f (x | θ0 ) and f (x | θ).


This distance, D(θ) ≥ 0 for all θ with equality if and only if
f (x | θ0 ) ≡ f (x | θ), which here we assume is the same as θ0 = θ (this
assumption is called the identifiability assumption). Hence, for θ 6= θ0 , we
obtain D(θ) > 0 and hence exp(−nD(θ)) −→ 0 as n ↑ ∞. In other words,
(
P 0, θ 6= θ0
exp(−nDn (θ)) −→ , n↑∞.
1, θ = θ0

Therefore, the posterior spikes at θ0 :


(
P 0, θ=6 θ0
π(θ | X n ) −→ , n↑∞.
π(θ0 | X n ), θ = θ0

Hence, the posterior mode, say θ̂n , converges in probability to θ0 .


24/39
Posterior Asymptotics (Section 8.4 of Kroese & Chan)

Recall: Taylor series


For a function f (x) that is infinitely differentiable at a, then

f 0 (a)(x − a) f 00 (a)(x − a)2 f 000 (a)(x − a)3


f (x) = f (a) + + + + ...
1! 2! 3!

2. Asymptotic Normality: (univariate and continuous θ). Taylor expanding


Dn (θ) around θ0 yields:
(θ−θ0 )2 d2 Dn
Dn (θ) = Dn (θ0 ) + (θ − θ0 ) dD

n
(θ0 ) + 2 d2 θ
(θ0 ) + O((θ − θ0 )3 ) .

Now: a) we ignore the negligible residual O((θ − θ0 )3 ); b) we note that


P
Dn (θ0 ) −→ D(θ0 ) = 0 and c) we note that (verify!)

dDn P dD
(θ0 ) −→ (θ0 ) = 0 .
dθ dθ

25/39
Posterior Asymptotics
Further, we have d):

d2 Dn d2 ln f (x | θ)
Z
P
(θ0 ) −→ − f (x | θ0 ) dx := I(θ0 ),
d2 θ dθ2

θ=θ0

the last being the definition of Fisher’s information (a matrix, in general).


Therefore, as n ↑ ∞
(θ − θ0 )2
Dn (θ) ' I(θ0 ) .
2
Similarly, ln π(θ) = ln π(θ0 ) + O(θ − θ0 ). Using all of these results, the
posterior is then proportional to (as n ↑ ∞):
2
π(θ | X n ) ∝ exp(−n (θ−θ2 0 ) I(θ0 ))

In other words, the posterior converges to the pdf of the


 
1
N θ0 ,
nI(θ0 )

distribution.

26/39
Posterior Asymptotics
Example: Normal model
Let X1 , . . . , Xn ∼ N(θ, σ 2 ) where σ 2 is known.
As usual, this gives the log-likelihood
Pn
(xi − θ)2
ln f (x | θ) = − i=1 2 + c1
σ
from which we obtain
n
d ln f (x | θ) 1 X
= 2 (xi − θ)
dθ σ i=1

and so
d2 ln f (x | θ)
= −n/σ 2 .
d2 θ
The mle is θ̂ = X̄n and In (θ) = nI(θ) = n/σ 2 . So asymptotically, as
n → ∞,
θ | X n ∼ N(X̄n , σ 2 /n).
This is true for any prior distribution which places non-zero probability
around the true value of θ.

27/39
Likelihood Asymptotics
Consider again the likelihood model X ∼ Bin(n, θ).
!
n x
f (x | θ) = θ (1 − θ)n−x , x = 0, . . . , n
x
thus
log(f (x | θ)) = x log θ + (n − x) log(1 − θ)
so,
d ln f (x | θ) x n−x
= −
dθ θ 1−θ
and
d2 ln f (x | θ) x n−x
=− 2 − .
d2 θ θ (1 − θ)2
Consequently
nθ n(1 − θ) n
In (θ) = + = . (EX = nθ)
θ2 (1 − θ)2 θ(1 − θ)
Thus, as n → ∞, we have
 
d θ(1 − θ)
θ | X −→ N θ, .
n

28/39
Outline

1. Posterior Inference
• Loss functions, predictive inference

2. Posterior Asymptotics

3. Monte Carlo methods


• Importance sampling.

29/39
Importance Sampling

Importance Sampling Algorithm

Algorithm 1 Importance Sampling


for i = 1, . . . , N do
Draw X (i) ∼ g(x)
f (X (i) )
W (i) ∝ g(X (i) )

Notes:
I The samples (X (1) , W (1) ), . . . , (X (N ) , W (N ) ) are weighted samples
from f (x).
I Weight is ∝ f (X)/g(X), not f (X)/Kg(X) (as for rejection sampling)
as K is lost in proportionality

How does inference work for weighted samples?

30/39
Importance Sampling
R
How does inference work for weighted samples? (Assume f (x)dx = 1.)
Unweighted expectation:
Z N
1 X
Eg [h(X)] = h(x)g(x)dx ≈ h(X (i) )
N i=1

where X (1) , . . . , X (N ) are samples from g(x).

Weighted expectation: Defining weights w(x) = f (x)/g(x), then


Z N
1 X (i)
Eg [w(x)h(x)] = w(x)h(x)g(x)dx ≈ W h(X (i) )
N i=1
Z
= h(x)f (x)dx

= Ef [h(x)]

where X (1) , . . . , X (N ) are samples from g(x).

i.e. Weighted expectations under g(x) act as expectations under f (x).

31/39
Importance Sampling
What if f (x) is unnormalised?
We then have f (x) = f˜(x)/Z where Z = f˜(x)dx is unknown.
R

Weighted expectation: Defining weights W̃ (X) = f˜(X)/g(X), note that:


Z Z N
1 X
Eg [w̃(X)] = w̃(x)g(x)dx = f˜(x)dx = Z ≈ W̃ (X (i) ).
N i=1

Z Z
1
Ef [h(x)] = h(x)f (x)dx = h(x)f˜(x)dx
Z
Z
1 Eg [w̃(x)h(x)]
= w̃(x)h(x)g(x)dx =
Z Eg [w̃(x)]
PN (i) ∗ N
1
i=1 W̃ (X )h(X (i) ) X
≈ N
1
PN = W (X (i) )h(X (i) )
(i) )
N i=1 W̃ (X i=1
PN
where W (X) = W̃ (X)/ i=1 W̃ (X) and X (1) , . . . , X (N ) ∼ g(x).

i.e. normalise weights (to sum to one) then take expectations.



– this is a biased estimator.
32/39
Importance Sampling

Example:
Simulate from the density

20x(1 − x)3

0≤x≤1
f (x) =
0 otherwise.

Use importance sampling with



1 0≤x≤1
g(x) =
0 otherwise

(that is, use the uniform density on [0, 1]).

Previously (Lecture 3) used rejection sampling with


I K = 135/64 (if f (x) is normalised as f (x) = 20x(1 − x)3 )
I K = 27/256 (if f (x) ∝ x(1 − x)3 .

33/39
Importance Sampling
L=10000
K=135/64
x=runif(L)
ind=(runif(L)<(20*x*(1-x)∧ 3/K))
2.0

hist(x[ind],probability=T,
True xlab="x",ylab="Density",main="")
Rejection Sampling xx=seq(0,1,length=100
Importance Sampling lines(xx,20*xx*(1-xx)∧ 3,lwd=2,col=2)
1.5

d=density(x[ind],from=0,to=1)
lines(d,col=4)
Density

y=runif(L)
w=20*y*(1-y)∧ 3
1.0

wTilde=y*(1-y)∧ 3
W=wTilde/sum(wTilde)
d=density(y,weights=w,from=0,to=1)
0.5

lines(d,col=3)
0.0

0.0 0.2 0.4 0.6 0.8


x

> mean(x[ind])
[1] 0.3353401
> mean(y)
[1] 0.5042156
> mean(w*y)
[1] 0.3334709
> sum(W*y)
[1] 0.3363843
34/39
Importance Sampling
0.00020
Weights
Weight
0.00010
0.00000

0.0 0.2 0.4 0.6 0.8 1.0


y

I Shows density of f (x) only as g(x) is uniform.


I In general shows ∝ f (x)/g(x)

35/39
Importance Sampling
Note:
Weights
I Variability in weights mean
0.00020

some weighted samples


contribute more than others
in computations
I Samples with low weights
Weight
0.00010

have small contribution


I ⇒ for efficiency, would like all
samples to contribute as
0.00000

equally as possible
0.0 0.2 0.4 0.6 0.8 1.0
y

Concept: Variability of weights


I If weights are highly variable then Var(wi ) is high. (=bad)
I If weights have low variability then they are all similar values (Var(wi )
is small) (=good).
I For best performance, prefer low variability weights
36/39
Importance Sampling

Weights variance usually measured through Effective Sample Size (ESS)


n
" #−1
X
ESS = (W (i) )2
i=1
(i) (i)
where W = W (X ) are normalised weights.
Note that
1 ≤ ESS ≤ n.

I ESS=1 when W (1) = 1, W (2) , . . . , W (n) = 0 (sample depletion)


I ESS=n when W (i) = 1/n for all i.
I Loose interpretation: equivalent number of equally weighted
independent samples
To maximise ESS, choose g(x) to closely match f (x).
Same idea when improving efficiency of rejection sampling.

37/39
Importance Sampling
Example: Obtain N samples from Beta(2, 2).

1.5
1.5

1.0
1.0

0.5
0.5
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

red = f(x), blue = Kg(x)


Top: g(x)=Beta(1,1), eff = 681/1000 ESS=4172.82 (84%)
Bottom: g(x)=Beta(1.5,1.5), eff = 848/1000 ESS=4634.02 (93%)
38/39
Importance Sampling
Monte Carlo integration using g(x): R
In order to estimate the integral of Z = φ(x)dx we can:
I Reparameterise to an integral over U(0, 1) then compute as
Z Z
1 X 0 (i)
φ(x)dx = φ0 (u)du ≈ φ (U )
N i

with U (1) , . . . , U (N ) ∼ U(0, 1) (as before).


I As above, but with U (1) , . . . , U (N ) ∼ q(u) on (0,1).
Z 0
1 X φ0 (U (i) )
Z Z
φ (u)
φ(x)dx = φ0 (u)du = q(u)du ≈
q(u) N i q(U (i) )

where q = U(0, 1) recovers the first method.


I Integrate without transformation to (0,1)

1 X φ(X (i) )
Z Z
φ(x)
φ(x)dx = g(x)dx ≈
g(x) N i g(X (i) )

where X (1) , . . . , X (N ) ∼ g(x).


The aim is to choose q to minimise the variability of φ(X)/q(X) (etc).
39/39

S-ar putea să vă placă și