Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018

Lecture 4: Inference, Asymptotics & Monte
Carlo
August 11, 2018
1/39
Outline
1. Posterior Inference
• Loss functions, predictive inference
2. Posterior Asymptotics
3. Monte Carlo methods

• Importance sampling.
2/39
Outline

3/39
Summarising Posterior Information
The posterior π(θ | x) is a complete summary of the inference about θ.
In some sense π(θ | x) is the inference.
However, for many applications we wish to summarise this information to

make a decision.
1. What is the “best” point estimate θ̂ of θ?
E.g. Posterior mean, median, mode etc.
2. What decision d ∈ {d1 , d2 , . . .} = D is the optimal choice to make,

given knowledge of θ? E.g.
• How many loaves of bread to bake per day to maximise profit?
– balancing baking costs and number of loaves sold/wasted.
• How high to build sea walls to minimise cost?
– balancing construction cost, chance of breach and resulting
damages.
• How much local rail infrastructure to build?
– balancing construction costs versus improvement to economy
and other benefits.
4/39
Loss Functions
How do we define “best” or optimal?
Loss Function
For a prediction d ∈ D, a loss function
`(θ, d)
defines the penalty in taking decision d given (fixed) parameter value θ.

I Negative loss is a gain, and is beneficial.
I Sometimes expressed as maximising utility −`(θ, d).
The premise is then:

I Choose the decision d∗ = argmind `(θ, d) that minimises the loss.
However, θ is not known, but rather θ ∼ π(θ | x). So alternatively
d∗ = argmin Eπ [`(θ, d)] .

d
i.e. choose d which minimises the expected posterior loss.
5/39
Loss functions
A full (decision theoretic) Bayesian setup consists of specifying:

I Prior distribution π(θ)
I Model f (x | θ) (leading to the likelihood L(x | θ))
I Loss function `(θ, d).
Although most people only consider the first two of these.
6/39
Loss functions for estimating θ
We first consider loss functions for parameter estimation:

Given π(θ | x) what is the optimal point estimate of θ? (d = θ̂)
Consider 4 standard loss functions:

I Quadratic loss:
`(θ, d) = (θ − d)2
I Absolute error loss:
`(θ, d) = |θ − d|
I 0-1 loss:
0 if |d − θ| ≤
`(θ, d) =
1 if |d − θ| >
I Linear loss:
α(d − θ) if d > θ
`(θ, d) =
β(θ − d) if d < θ
for given α, β > 0.
7/39
Quadratic Loss
Z
Eπ [`(θ, d)] = `(θ, d)π(θ | x)dθ
Z
= (θ − d)2 π(θ | x)dθ
Z
= (θ − E(θ | x) + E(θ | x) − d)2 π(θ | x)dθ
Z Z
= (θ − E(θ | x))2 π(θ | x)dθ + (E(θ | x) − d)2 π(θ | x)dθ
Z
+2 (θ − E(θ | x))(E(θ | x) − d)π(θ | x)dθ
= Var(θ | x) + (E(θ | x) − d)2 + 0
I This is minimised when d = E(θ | x).

I The posterior mean minimises quadratic loss.
I The expected loss is the posterior variance.
8/39
Linear Loss
Linear loss:

α(d − θ) if d > θ
`(θ, d) =
β(θ − d) if d < θ
for given α, β > 0.
For any d, we have E`(X, d) =
= αE(d − X)+ + βE(X − d)+

= αE[d − X; X < d] + βE[X − d; X > d]
= d(αP[X < d] − βP[X > d]) + βE[X; X > d] − αE[X; X < d]
= d((α + β)P[X < d] − β) − (α + β)E[X; X < d] + βEX
Let d∗ be the β/(α + β) quantile, that is,
P[X < d∗ ] = β/(α + β) .
9/39
Linear Loss
Then, for any other d, we have E`(X, d) − E`(X, d∗ ) =
= d((α + β)P[X < d] − β) − (α + β)(E[X; X < d] − E[X; X < d∗ ])

n o
= (α + β) d(P[X < d] − P[X < d∗ ]) + E[X; X < d∗ ] − E[X; X < d]
Hence, if d∗ > d, then
E`(X, d) − E`(X, d∗ )
= E[X; d < X < d∗ ] − dP[d < X < d∗ ]
(α + β)
= P[d < X < d∗ ] (E[X | d < X < d∗ ] − d)
| {z }
≥0
The case for d∗ < d is dealt with similarly, so we obtain the following.
β
So linear loss is minimised at the α+β
posterior quantile.
10/39
Absolute Error Loss
Absolute error loss:

`(θ, d) = |θ − d|
Absolute error loss = Linear loss with α = β = 1.
When α = β = 1, d∗ is the median of the posterior distribution.
⇒ The posterior median minimises absolute error loss.
11/39
0-1 Loss
0-1 loss:

0 if |d − θ| ≤
`(θ, d) =
1 if |d − θ| >
Here
E[`(θ, d)] = P(|θ − d| > )

= 1 − P(|θ − d| ≤ ).
I |θ − d| ≤ defines an interval [θ − , θ + ] of length 2

I To minimise, choose interval with highest probability i.e. high density
region
I Then θ is mid-point of interval with highest probability
I Choosing arbitrarily small will select the posterior mode for d.
⇒ The posterior mode minimises 0-1 loss.
12/39
Loss functions for making other decisions
Example: Baking loaves of bread

I c = the cost of baking a loaf of bread
I s > c is price loaf sells for
I π(d | x) is the (posterior) distribution of demand for bread
I b ∈ {0, 1, . . .} is the decision (i.e. number of loaves to bake).
Want to maximise expected profit.
Set up (obvious) loss function:

I If baker bakes b loaves with demand d then profit is:

 (s − c)d − c(b − d) for b>d
Profit = (s − c)d for b=d
(s − c)b for b<d

13/39
Loss functions for making other decisions
I Make Profit relative to selling b = d loaves

(i.e. remove (s − c)d from all terms)

 −c(b − d) for b > d
0
Profit = 0 for b = d
(s − c)b − (s − c)d = (s − c)(b − d) for b < d

I Loss = -Profit

c(b − d) for b ≥ d (cost of baking surplus loaves)
`(b, d) =
(s − c)(d − b) for b < d (lost profit from not baking enough loaves)
I This is linear loss with α = c and β = s − c

I Therefore the optimal decision minimising expected posterior loss is
β
the α+β = s−c
s
quantile of π(d | x).
14/39
Outline

15/39
Predictive Inference
Previous focus on parameter estimation (e.g via loss functions).
Common interest in predictions about future observations.
In predictions there are two forms of uncertainty:

1. Uncertainty over the parameter values, which have been estimated
based on data x.
2. Uncertainty due to the fact that any future value is itself a random
event.
In classical statistics, typically predict with f (y | θ̂)
I θ̂ is the MLE - fixed.
I Only accounts for second source of uncertainty
I So predictions are more precise than they should be
I Problem stems from classical assumption of a single true value of θ
16/39
Bayesian framework allows for both sources of uncertainty by averaging

over uncertainty of parameter estimates.
The predictive density function of a future observation, Y ,

is
Z
f (y | x) = f (y | θ, x)π(θ | x)dθ.
While simple to express, can sometimes be difficult to compute.
Can use standard conjugate families to give tractable forms for predictive
distribution in some cases.
Is also simple to estimate via Monte Carlo methods.
17/39
Example: Binomial model.
Suppose we have X ∼ Binomial(n, θ) with conjugate prior θ ∼ Beta(a, b).

Then we know that
θ | X = x ∼ Beta(a + x, b + n − x).
Now suppose we intend to make N further observations.

Let Y be the number of successes, so Y | θ ∼ Binomial(N, θ). Hence
!
N y
f (y | θ) = θ (1 − θ)N −y .
y
So for y = 0, 1, . . . , N
!
1
θa+x−1 (1 − θ)b+n−x−1
Z
N y
f (y | x) = θ (1 − θ)N −y × dθ
0 y B(a + x, b + n − x)
!
N B(y + a + x, N − y + b + n − x)
= .
y B(a + x, b + n − x)
This is known as the Beta-Binomial distribution.

18/39
Example: Poisson model
We have X1 , . . . , Xn ∼ Poisson(θ) with conjugate θ ∼ Gamma(a, b).

Then we know that
θ | X = x ∼ Gamma(a + nx̄n , b + n).
Now Y ∼ Poisson(θ) so that
θy
f (y | θ) = exp(−θ) .
y!
Hence the predictive distribution is
θy (b + n)a+nx̄n a+nx̄n −1
Z
f (y | x) = exp(−θ) θ exp(−(b + n)θ)dθ
y! Γ(a + nx̄n )
! !y a+nx̄n
y + a + nx̄n − 1 1 1
(Tute 1) = 1− .
y b+n+1 b+n+1
which is the pdf of a NegBin(a + nx̄n , 1/(b + n + 1)) distribution.
19/39
Monte Carlo posterior predictive distributions
How to generate samples from f (y | x)?
Z
f (y | x) = f (y | θ, x)π(θ | x)dθ
Θ
or in machine learning notation with training data τ :
How to generate samples from f (x | τ )?

Z
f (x | τ ) = f (x | θ, τ )π(θ | τ )dθ
Θ
By inspection of formula:
I Obtain posterior samples θ i ∼ π(θ | τ )
I For each θ i we can generate X i ∼ f (x | θ i )
I This gives us joint samples (θ i , X i ) | x ∼ f (x | θ, τ )π(θ | τ )
I To obtain samples from f (y | x), “integrate out” θ
(i.e. discard the θ i values) to leave X i ∼ f (y | x) only
Hugely simpler than calculating exact algebraic expression!
20/39
Outline

21/39
Posterior Asymptotics
What happens to the posterior distribution as n → ∞?
1. Consistency:
If the “true” value of θ = θ0 ,
and if π(θ0 ) 6= 0 (or is non-zero in a neighbourhood of θ0 ),
then with increasing amounts of data (n −→ ∞) the posterior
probability that θ equals (or lies in a neighbourhood of) θ0 −→ 1.
Property akin to classical notion of ‘consistency’.
2. Asymptotic Normality for θ ∈ Rd :

As n −→ ∞ then
π(θ | x) −→ N(θ 0 , [I(θ 0 )]−1 /n).
All the following arguments are heuristic/outline proofs.
22/39
1. Consistency:
Let X1 , . . . , Xn ∼ f (x | θ0 ) be iid observations and suppose the prior is such
that π(θ0 ) 6= 0.
Then the posterior is

n
Y
π(θ | X n ) ∝ π(θ) f (Xi | θ)
i=1
Pn
= π(θ) exp( ln f (Xi | θ))
i=1
n
Y
= π(θ) exp(−nDn (θ)) f (Xi | θ0 )
i=1
1
Pn f (Xi | θ0 )
where Dn (θ) = n i=1 ln f (Xi | θ)
.
23/39
For fixed θ, Dn (θ) is the average of n iid random variables,
so converges in probability to its expectation (law of large numbers)
Z
f (x | θ0 )
E[Dn (θ)] = f (x | θ0 ) ln dx := D(θ) .
f (x | θ)
The RHS is the Kullback-Leibler distance between f (x | θ0 ) and f (x | θ).

This distance, D(θ) ≥ 0 for all θ with equality if and only if
f (x | θ0 ) ≡ f (x | θ), which here we assume is the same as θ0 = θ (this
assumption is called the identifiability assumption). Hence, for θ 6= θ0 , we
obtain D(θ) > 0 and hence exp(−nD(θ)) −→ 0 as n ↑ ∞. In other words,
(
P 0, θ 6= θ0
exp(−nDn (θ)) −→ , n↑∞.
1, θ = θ0
Therefore, the posterior spikes at θ0 :

(
P 0, θ=6 θ0
π(θ | X n ) −→ , n↑∞.
π(θ0 | X n ), θ = θ0
Hence, the posterior mode, say θ̂n , converges in probability to θ0 .

24/39
Posterior Asymptotics (Section 8.4 of Kroese & Chan)
Recall: Taylor series

For a function f (x) that is infinitely differentiable at a, then
f 0 (a)(x − a) f 00 (a)(x − a)2 f 000 (a)(x − a)3

f (x) = f (a) + + + + ...
1! 2! 3!
2. Asymptotic Normality: (univariate and continuous θ). Taylor expanding

Dn (θ) around θ0 yields:
(θ−θ0 )2 d2 Dn
Dn (θ) = Dn (θ0 ) + (θ − θ0 ) dD
dθ
n
(θ0 ) + 2 d2 θ
(θ0 ) + O((θ − θ0 )3 ) .
Now: a) we ignore the negligible residual O((θ − θ0 )3 ); b) we note that

P
Dn (θ0 ) −→ D(θ0 ) = 0 and c) we note that (verify!)
dDn P dD
(θ0 ) −→ (θ0 ) = 0 .
dθ dθ
25/39
Further, we have d):
d2 Dn d2 ln f (x | θ)
Z
P
(θ0 ) −→ − f (x | θ0 ) dx := I(θ0 ),
d2 θ dθ2

θ=θ0
the last being the definition of Fisher’s information (a matrix, in general).

Therefore, as n ↑ ∞
(θ − θ0 )2
Dn (θ) ' I(θ0 ) .
2
Similarly, ln π(θ) = ln π(θ0 ) + O(θ − θ0 ). Using all of these results, the
posterior is then proportional to (as n ↑ ∞):
2
π(θ | X n ) ∝ exp(−n (θ−θ2 0 ) I(θ0 ))
In other words, the posterior converges to the pdf of the

1
N θ0 ,
nI(θ0 )
distribution.
26/39
Example: Normal model
Let X1 , . . . , Xn ∼ N(θ, σ 2 ) where σ 2 is known.
As usual, this gives the log-likelihood
Pn
(xi − θ)2
ln f (x | θ) = − i=1 2 + c1
σ
from which we obtain
n
d ln f (x | θ) 1 X
= 2 (xi − θ)
dθ σ i=1
and so
d2 ln f (x | θ)
= −n/σ 2 .
d2 θ
The mle is θ̂ = X̄n and In (θ) = nI(θ) = n/σ 2 . So asymptotically, as
n → ∞,
θ | X n ∼ N(X̄n , σ 2 /n).
This is true for any prior distribution which places non-zero probability
around the true value of θ.
27/39
Likelihood Asymptotics
Consider again the likelihood model X ∼ Bin(n, θ).
!
n x
f (x | θ) = θ (1 − θ)n−x , x = 0, . . . , n
x
thus
log(f (x | θ)) = x log θ + (n − x) log(1 − θ)
so,
d ln f (x | θ) x n−x
= −
dθ θ 1−θ
and
d2 ln f (x | θ) x n−x
=− 2 − .
d2 θ θ (1 − θ)2
Consequently
nθ n(1 − θ) n
In (θ) = + = . (EX = nθ)
θ2 (1 − θ)2 θ(1 − θ)
Thus, as n → ∞, we have

d θ(1 − θ)
θ | X −→ N θ, .
n
28/39
Outline

29/39
Importance Sampling
Importance Sampling Algorithm
Algorithm 1 Importance Sampling

for i = 1, . . . , N do
Draw X (i) ∼ g(x)
f (X (i) )
W (i) ∝ g(X (i) )
Notes:
I The samples (X (1) , W (1) ), . . . , (X (N ) , W (N ) ) are weighted samples
from f (x).
I Weight is ∝ f (X)/g(X), not f (X)/Kg(X) (as for rejection sampling)
as K is lost in proportionality
How does inference work for weighted samples?
30/39
Importance Sampling
R
How does inference work for weighted samples? (Assume f (x)dx = 1.)
Unweighted expectation:
Z N
1 X
Eg [h(X)] = h(x)g(x)dx ≈ h(X (i) )
N i=1
where X (1) , . . . , X (N ) are samples from g(x).
Weighted expectation: Defining weights w(x) = f (x)/g(x), then

Z N
1 X (i)
Eg [w(x)h(x)] = w(x)h(x)g(x)dx ≈ W h(X (i) )
N i=1
Z
= h(x)f (x)dx
= Ef [h(x)]
where X (1) , . . . , X (N ) are samples from g(x).
i.e. Weighted expectations under g(x) act as expectations under f (x).
31/39
Importance Sampling
What if f (x) is unnormalised?
We then have f (x) = f˜(x)/Z where Z = f˜(x)dx is unknown.
R
Weighted expectation: Defining weights W̃ (X) = f˜(X)/g(X), note that:

Z Z N
1 X
Eg [w̃(X)] = w̃(x)g(x)dx = f˜(x)dx = Z ≈ W̃ (X (i) ).
N i=1
Z Z
1
Ef [h(x)] = h(x)f (x)dx = h(x)f˜(x)dx
Z
Z
1 Eg [w̃(x)h(x)]
= w̃(x)h(x)g(x)dx =
Z Eg [w̃(x)]
PN (i) ∗ N
1
i=1 W̃ (X )h(X (i) ) X
≈ N
1
PN = W (X (i) )h(X (i) )
(i) )
N i=1 W̃ (X i=1
PN
where W (X) = W̃ (X)/ i=1 W̃ (X) and X (1) , . . . , X (N ) ∼ g(x).
i.e. normalise weights (to sum to one) then take expectations.

∗
– this is a biased estimator.
32/39
Importance Sampling
Example:
Simulate from the density
20x(1 − x)3

0≤x≤1
f (x) =
0 otherwise.
Use importance sampling with

1 0≤x≤1
g(x) =
0 otherwise
(that is, use the uniform density on [0, 1]).
Previously (Lecture 3) used rejection sampling with

I K = 135/64 (if f (x) is normalised as f (x) = 20x(1 − x)3 )
I K = 27/256 (if f (x) ∝ x(1 − x)3 .
33/39
Importance Sampling
L=10000
K=135/64
x=runif(L)
ind=(runif(L)<(20*x*(1-x)∧ 3/K))
2.0
hist(x[ind],probability=T,
True xlab="x",ylab="Density",main="")
Rejection Sampling xx=seq(0,1,length=100
Importance Sampling lines(xx,20*xx*(1-xx)∧ 3,lwd=2,col=2)
1.5
d=density(x[ind],from=0,to=1)
lines(d,col=4)
Density
y=runif(L)
w=20*y*(1-y)∧ 3
1.0
wTilde=y*(1-y)∧ 3
W=wTilde/sum(wTilde)
d=density(y,weights=w,from=0,to=1)
0.5
lines(d,col=3)
0.0
0.0 0.2 0.4 0.6 0.8

x
> mean(x[ind])
[1] 0.3353401
> mean(y)
[1] 0.5042156
> mean(w*y)
[1] 0.3334709
> sum(W*y)
[1] 0.3363843
34/39
Importance Sampling
0.00020
Weights
Weight
0.00010
0.00000
0.0 0.2 0.4 0.6 0.8 1.0

y
I Shows density of f (x) only as g(x) is uniform.

I In general shows ∝ f (x)/g(x)
35/39
Importance Sampling
Note:
Weights
I Variability in weights mean
0.00020
some weighted samples

contribute more than others
in computations
I Samples with low weights
Weight
0.00010
have small contribution

I ⇒ for efficiency, would like all
samples to contribute as
0.00000
equally as possible
0.0 0.2 0.4 0.6 0.8 1.0
y
Concept: Variability of weights

I If weights are highly variable then Var(wi ) is high. (=bad)
I If weights have low variability then they are all similar values (Var(wi )
is small) (=good).
I For best performance, prefer low variability weights
36/39
Importance Sampling
Weights variance usually measured through Effective Sample Size (ESS)

n
" #−1
X
ESS = (W (i) )2
i=1
(i) (i)
where W = W (X ) are normalised weights.
Note that
1 ≤ ESS ≤ n.
I ESS=1 when W (1) = 1, W (2) , . . . , W (n) = 0 (sample depletion)

I ESS=n when W (i) = 1/n for all i.
I Loose interpretation: equivalent number of equally weighted
independent samples
To maximise ESS, choose g(x) to closely match f (x).
Same idea when improving efficiency of rejection sampling.
37/39
Importance Sampling
Example: Obtain N samples from Beta(2, 2).
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
red = f(x), blue = Kg(x)

Top: g(x)=Beta(1,1), eff = 681/1000 ESS=4172.82 (84%)
Bottom: g(x)=Beta(1.5,1.5), eff = 848/1000 ESS=4634.02 (93%)
38/39
Importance Sampling
Monte Carlo integration using g(x): R
In order to estimate the integral of Z = φ(x)dx we can:
I Reparameterise to an integral over U(0, 1) then compute as
Z Z
1 X 0 (i)
φ(x)dx = φ0 (u)du ≈ φ (U )
N i
with U (1) , . . . , U (N ) ∼ U(0, 1) (as before).

I As above, but with U (1) , . . . , U (N ) ∼ q(u) on (0,1).
Z 0
1 X φ0 (U (i) )
Z Z
φ (u)
φ(x)dx = φ0 (u)du = q(u)du ≈
q(u) N i q(U (i) )
where q = U(0, 1) recovers the first method.

I Integrate without transformation to (0,1)
1 X φ(X (i) )
Z Z
φ(x)
φ(x)dx = g(x)dx ≈
g(x) N i g(X (i) )
where X (1) , . . . , X (N ) ∼ g(x).

The aim is to choose q to minimise the variability of φ(X)/q(X) (etc).
39/39

Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 4: Inference, Asymptotics & Monte Carlo: August 11, 2018

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 4: Inference, Asymptotics & Monte

August 11, 2018

3. Monte Carlo methods

3. Monte Carlo methods

However, for many applications we wish to summarise this information to

2. What decision d ∈ {d1 , d2 , . . .} = D is the optimal choice to make,

defines the penalty in taking decision d given (fixed) parameter value θ.

The premise is then:

However, θ is not known, but rather θ ∼ π(θ | x). So alternatively

d∗ = argmin Eπ [`(θ, d)] .

i.e. choose d which minimises the expected posterior loss.

A full (decision theoretic) Bayesian setup consists of specifying:

We first consider loss functions for parameter estimation:

Consider 4 standard loss functions:

= Var(θ | x) + (E(θ | x) − d)2 + 0

I This is minimised when d = E(θ | x).

For any d, we have E`(X, d) =

= αE(d − X)+ + βE(X − d)+

Let d∗ be the β/(α + β) quantile, that is,

P[X < d∗ ] = β/(α + β) .

Then, for any other d, we have E`(X, d) − E`(X, d∗ ) =

= d((α + β)P[X < d] − β) − (α + β)(E[X; X < d] − E[X; X < d∗ ])

Hence, if d∗ > d, then

Absolute error loss:

Absolute error loss = Linear loss with α = β = 1.

When α = β = 1, d∗ is the median of the posterior distribution.

⇒ The posterior median minimises absolute error loss.

E[`(θ, d)] = P(|θ − d| > )

I |θ − d| ≤  defines an interval [θ − , θ + ] of length 2

⇒ The posterior mode minimises 0-1 loss.

Example: Baking loaves of bread

Want to maximise expected profit.

Set up (obvious) loss function:

I Make Profit relative to selling b = d loaves

I This is linear loss with α = c and β = s − c

3. Monte Carlo methods

Previous focus on parameter estimation (e.g via loss functions).

Common interest in predictions about future observations.

In predictions there are two forms of uncertainty:

Bayesian framework allows for both sources of uncertainty by averaging

The predictive density function of a future observation, Y ,

While simple to express, can sometimes be difficult to compute.

Is also simple to estimate via Monte Carlo methods.

Suppose we have X ∼ Binomial(n, θ) with conjugate prior θ ∼ Beta(a, b).

Now suppose we intend to make N further observations.

This is known as the Beta-Binomial distribution.

We have X1 , . . . , Xn ∼ Poisson(θ) with conjugate θ ∼ Gamma(a, b).

θ | X = x ∼ Gamma(a + nx̄n , b + n).

Now Y ∼ Poisson(θ) so that

which is the pdf of a NegBin(a + nx̄n , 1/(b + n + 1)) distribution.

or in machine learning notation with training data τ :

How to generate samples from f (x | τ )?

3. Monte Carlo methods

What happens to the posterior distribution as n → ∞?

Property akin to classical notion of ‘consistency’.

2. Asymptotic Normality for θ ∈ Rd :

π(θ | x) −→ N(θ 0 , [I(θ 0 )]−1 /n).

All the following arguments are heuristic/outline proofs.

Then the posterior is

The RHS is the Kullback-Leibler distance between f (x | θ0 ) and f (x | θ).

Therefore, the posterior spikes at θ0 :

Hence, the posterior mode, say θ̂n , converges in probability to θ0 .

Recall: Taylor series

E[`(θ, d)] = P(|θ − d| > )

I |θ − d| ≤ defines an interval [θ − , θ + ] of length 2