Sunteți pe pagina 1din 69

An Introduction to

Objective Bayesian Statistics


Jos M. Bernardo
Universitat de Valncia, Spain
<jose.m.bernardo@uv.es>
http://www.uv.es/bernardo
Universit de Neuchtel, Switzerland
March 10thMarch 18th, 2004
2
Summary
1. Concept of Probability
Introduction. Notation. Statistical models.
Intrinsic discrepancy. Intrinsic convergence of distributions.
Foundations. Probability as a rational degree of belief.
2. Basics of Bayesian Analysis
Parametric inference. The learning process.
Reference analysis. No relevant initial information.
Inference summaries. Point and interval estimation.
Prediction. Regression.
Hierarchical models. Exchangeability.
3. Decision Making
Structure of a decision problem. Intrinsic Loss functions.
Formal point estimation. Intrinsic estimation.
Hypothesis testing. Bayesian reference criterion (BRC).
3
1. Concept of Probability
1.1. Introduction
Tentatively accept a formal statistical model
Typically suggested by informal descriptive evaluation
Conclusions conditional on the assumption that model is correct
Bayesian approach rmly based on axiomatic foundations
Mathematical need to describe by probabilities all uncertainties
Parameters must have a (prior) distribution describing available
information about their values
Not a description of their variability (xed unknown quantities),
but a description of the uncertainty about their true values.
Important particular case: no relevant (or subjective) initial information
Prior only based on model assumptions and well-documented data
Objective Bayesian Statistics:
Scientic and industrial reporting, public decision making
4
Notation
Under conditions C, p(x| C), ( | C) are, respectively, probability
densities (or mass) functions of observables x and parameters
p(x| C) 0,
_
X
p(x| C) dx = 1, E[x| C] =
_
X
xp(x| C) dx,
( | C) 0,
_

( | C) d = 1, E[ | C] =
_

( | C) d.
Special densities (or mass) functions use specic notation, as
N(x| ,
2
), Bi(x| n, ), or Pn(x| ). Other examples:
Beta {Be(x| , ), 0 < x < 1, > 0, > 0}
Be(x| , ) =
(+)
()()
x
1
(1 x)
1
Gamma {Ga(x| , ), x > 0, > 0, > 0}
Ga(x| , ) =

()
x
1
e
x
Student {St(x| ,
2
, ), x , , > 0, > 0}
St(x| ,
2
, ) =
{(+1)/2)}
(/2)
1

_
1 +
1

_
x

_
2
_
(+1)/2
5
Statistical Models
Statistical model generating x X, {p(x| ), x X, }
Parameter vector = {
1
, . . . ,
k
} . Parameter space
k
.
Data set x X. Sampling space X, of arbitrary structure.
Likelihood function of x, l( | x).
l( | x) = p(x| ), as a function of .
Maximum likelihood estimator (mle) of

=

(x) = arg sup

l( | x)
Data x = {x
1
, . . . , x
n
} random sample (iid) from model if
p(x| ) =

n
j=1
p(x
j
| ), x
j
X, X = X
n
Behaviour under repeated sampling (general, not iid data)
Considering {x
1
, x
2
, . . .}, a (possibly innite) sequence
of possible replications of the complete data set x.
Denote by x
(m)
= {x
1
, . . . , x
m
} a nite set of m such replications.
Asymptotic results obtained as m
6
1.2. Intrinsic Divergence
Logarithmic divergences
The logarithmic divergence (Kullback-Leibler) k{ p | p} of a density p(x)
from its true density p(x), is
k{ p | p} =
_
X
p(x) log
p(x)
p(x)
dx, (provided this exists)
The functional k{ p | p} is non-negative, (zero iff, p(x) = p(x) a.e.) and
invariant under one-to-one transformations of x.
But k{p
1
| p
2
} is not symmetric and diverges if, strictly, X
2
X
1
.
Intrinsic discrepancy between distributions
{p, q} = min
_
_
X
p(x) log
p(x)
q(x)
dx,
_
X
q(x) log
q(x)
p(x)
dx
_
The intrinsic discrepancy {p, q}} is non-negative, (zero iff, p = p a.e.)
invariant under one-to-one transformations of x,
Denedif X
2
X
1
or X
1
X
2
, operative interpretationas the minimum
amount of information (in nits) required to discriminate.
7
Interpretation and calibration of the intrinsic discrepancy
Let {p
1
(x|
1
),
1

1
} or {p
2
(x|
2
),
2

2
} be two alternative
statistical models for x X, one of which is assumed to be true. The
intrinsic divergence {
1
,
2
} = {p
1
, p
2
} is then minimum expected
log-likelihood ratio in favour of the true model.
Indeed, if p
1
(x|
1
) true model, the expected log-likelihood ratio in its
favour is E
1
[log{p
1
(x|
1
)/p
2
(x|
1
)}] = k{p
2
| p
1
}. If the true model
is p
2
(x|
2
), the expected log-likelihood ratio in favour of the true model
is k{p
2
| p
1
}. But {p
2
| p
1
} = min[k{p
2
| p
1
}, k{p
1
| p
2
}].
Calibration. = log[100] 4.6 nits, likelihood ratios for the true model
larger than 100 making discrimination very easy.
= log(1 +) nits, likelihood ratios for the true model may about
1 + making discrimination very hard.
Intrinsic Discrepancy 0.01 0.69 2.3 4.6 6.9
Average Likelihood Ratio
for true model exp[] 1.01 2 10 100 1000
8
Example. Conventional Poisson approximation Pn(r | n) of Binomial
probabilities Bi(r | n, )
(Bi, Pn) = (n, ) = k(Pn | Bi) =

n
r=0
Bi(r | n, ) log
Bi(r | n,)
Pn(r | n)
{Bi(r | , n, ), Pn(r | , n)}
n = 1
n = 2
n = 5

0.05 0.1 0.15 0.2


0.005
0.01
0.015
0.02
9
Intrinsic Convergence of Distributions
Intrinsic Convergence. A sequence of probability densities (or mass)
functions {p
i
(x)}

i=1
converges intrinsically to p(x) if (and only if) the
intrinsic divergence between p
i
(x) and p(x) converges to zero. i.e., iff
lim
i
(p
i
, p) = 0.
Example. Normal approximation to a Student distribution.
() = {St(x| 0, 1, ), N(x| 0, 1)}
=
_

N(x| 0, 1) log
N(x| 0, 1)
St(x| 0, 1, )
dx
1
(1 +)
2
2 4 6 8 10
0.2
0.4
0.6
0.8
1
()

The function () converges rapidly to zero. (18) = 0.004.


10
1.3. Foundations
Foundations of Statistics
Axiomatic foundations on rational description of uncertainty imply that
the uncertainty about all unknown quantities should be measured with
probability distributions {( | C), } describing the plausibility
of their given available conditions C.
Axioms have a strong intuitive appeal; examples include
Transitivity of plausibility.
If E
1
> E
2
| C, and E
2
> E
3
| C, then E
1
> E
3
| C
The sure-thing principle.
If E
1
> E
2
| A, C and E
1
> E
2
| A, C, then E
1
> E
2
| C).
Axioms are not a description of actual human activity, but a normative
set of principles for those aspiring to rational behaviour.
Absolute probabilities do not exist. Typical applications produce
Pr(E | x, A, K), a measure of rational belief in the occurrence of the
event E, given data x, assumptions A and available knowledge K.
11
Probability as a Measure of Conditional Uncertainty
Axiomatic foundations imply that Pr(E | C), the probability of an event
E given C is always a conditional measure of the (presumably rational)
uncertainty, on a [0, 1] scale, about the occurrence of E in conditions C.
Probabilistic diagnosis.V is the event that a person carries a virus
and + a positive test result. All related probabilities, e.g.,
Pr(+| V ) = 0.98, Pr(+| V ) = 0.01, Pr(V | K) = 0.002,
Pr(+| K) = Pr(+| V )Pr(V | K) + Pr(+| V )Pr(V | K) = 0.012
Pr(V | +, A, K) =
Pr(+| V )Pr(V | K)
Pr(+| K)
= 0.164 (Bayes Theorem)
are conditional uncertainty measures (and proportion estimates).
Estimation of a proportion.Survey conducted to estimate
the proportion of positive individuals in a population.
Random sample of size n with r positive.
Pr(a < < b | r, n, A, K), a conditional measure of the uncertainty
about the event that belongs to [a, b] given assumptions A,
initial knowledge K and data {r, n}.
12
Measurement of a physical constant.Measuring the unknown value of
physical constant , with data x = {x
1
, . . . , x
n
}, considered to be
measurements of subject to error. Desired to nd
Pr(a < < b | x
1
, . . . , x
n
, A, K), the probability that the unknown
value of (xed in nature, but unknown to the scientists)
belongs to [a, b] given the information provided by the data x,
assumptions A made, and available knowledge K.
The statistical model may include nuisance parameters, unknown quan-
tities , which have to be eliminated in the statement of the nal results.
For instance, the precision of the measurements described by unknown
standard deviation in a N(x| , ) normal model
Relevant scientic information may impose restrictions on the admissi-
ble values of the quantities of interest. These must be taken into account.
For instance, in measuring the value of the gravitational eld g in a
laboratory, it is known that it must lie between 9.7803 m/sec
2
(average
value at the Equator) and 9.8322 m/sec
2
(average value at the poles).
13
Future discrete observations.Experiment counting the number r
of times that an event E takes place in each of n replications.
Desired to forecast the number of times r that E will take place
in a future, similar situation, Pr(r | r
1
, . . . , r
n
, A, K).
For instance, no accidents in each of n = 10 consecutive months
may yield Pr(r = 0 | x, A, K) = 0.953.
Future continuous observations.Data x = {y
1
, . . . , y
n
}. Desired
to forecast the value of a future observation y, p(y | x, A, K).
For instance, from breaking strengths x = {y
1
, . . . , y
n
} of n
randomly chosen safety belt webbings, the engineer may nd
Pr(y > y

| x, A, K) = 0.9987.
Regression.Data set consists of pairs x = {(y
1
, v
1
), . . . , (y
n
, v
n
)}
of quantity y
j
observed in conditions v
j
.
Desired to forecast the value of y in conditions v, p(y | v, x, A, K).
For instance, y contamination levels, v wind speed from source;
environment authorities interested in Pr(y > y

| v, x, A, K)
14
2. Basics of Bayesian Analysis
2.1. Parametric Inference
Bayes Theorem
Let M = {p(x| ), x X, } be an statistical model, let ( | K)
be a probability density for given prior knowledge K and let xbe some
available data.
( | x, M, K) =
p(x| ) ( | K)
_

p(x| ) ( | K) d
,
encapsulates all information about given data and prior knowledge.
Simplifying notation, Bayes theorem may be expressed as
( | x) p(x| ) () :
The posterior is proportional to the likelihood times the prior. The
missing proportionality constant [
_

p(x| ) () d]
1
may be de-
duced from the fact that ( | x) must integrate to one. To identify a
posterior distribution it sufces to identify a kernel k(, x) such that
( | x) = c(x) k(, x). This is a very common technique.
15
Bayesian Inference with a Finite Parameter Space
Model {p(x|
i
), x X,

}, with = {
1
, . . . ,
m
}, so that
may only take a nite number m of different values. Using the nite
form of Bayes theorem,
Pr(
i
| x) =
p(x|
i
) Pr(
i
)

m
j=1
p(x|
j
) Pr(
j
)
,
i = 1, . . . , m.
Example: Probabilistic diagnosis. A test to detect a virus, is known
from laboratory research to give a positive result in 98% of the infected
people and in 1% of the non-infected. The posterior probability that a
person who tested positive is infected is
Pr(V | +)
Pr(V )
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
Pr(V | +) =
0.98 p
0.98 p+0.01 (1p)
as a function of p = Pr(V ).
Notice sensitivity of posterior
Pr(V | +) to changes
in the prior p = Pr(V ).
16
Example: Inference about a binomial parameter
Let data x be n Bernoulli observations with parameter
which contain r positives, so that p(x| , n) =
r
(1 )
nr
.
If () = Be( | , ), then
( | x)
r+1
(1 )
nr+1
kernel of Be( | r +, n r +).
Prior information (K)
P(0.4 < < 0.6) = 0.95,
and symmetric, yields = = 47;
No prior information = = 1/2
n = 1500, r = 720
P( < 0.5 | x, K) = 0.933
P( < 0.5 | x) = 0.934
n = 100, r = 0
P( < 0.01 | x) = 0.844
Notice:

= 0, but Me[ | x] = 0.0023
0.005 0.01 0.015 0.02 0.025
100
200
300
400
500
0.35 0.4 0.45 0.5 0.55 0.6 0.65
5
10
15
20
25
30
17
Sufciency
Given a model p(x| ), a function of the data t = t(x), is a sufcient
statistic if it encapsulates all information about available in x.
Formally, t = t(x) is sufcient if (and only if), for any prior ()
( | x) = ( | t). Hence, ( | x) = ( | t) p(t | ) ().
This is equivalent to the frequentist denition; thus t = t(x) is sufcient
iff p(x| ) = f(, t)g(x).
A sufcient statistic always exists, for t(x) = x is obviously sufcient
A much simpler sufcient statistic, with xed dimensionality
independent of the sample size, often exists.
This is case whenever the statistical model belongs to the
generalized exponential family, which includes many of the
more frequently used statistical models.
In contrast to frequentist statistics, Bayesian methods are independent
on the possible existence of a sufcient statistic of xed dimensionality.
For instance, if data come from an Student distribution, there is no suf-
cient statistic of xed dimensionality: all data are needed.
18
Example: Inference from Cauchy observations
Data x = {x
1
, . . . , x
n
} random from Ca(x| , 1) = St(x| , 1, 1).
Objective reference prior for the location parameter is () = 1.
By Bayes theorem,
(| x)

n
j=1
Ca(x
j
| , 1)()

n
j=1
1
1 + (x
j
)
2
.
Proportionality constant easily obtained by numerical integration.
Five samples of size n = 2
simulated from Ca(x| 5, 1).
x
1
x
2
4.034 4.054
21.220 5.831
5.272 6.475
4.776 5.317
7.409 4.743
0 5 10 15 20 25
0.1
0.2
0.3
0.4
0.5
0.6

(| x)
19
Improper prior functions
Objective Bayesian methods often use functions which play the role of
prior distributions but are not probability distributions.
An improper prior function is an non-negative function () such that
_

() d is not nite.
The Cauchy example uses the improper prior function () = 1, .
() is an improper prior function, {
i
}

i=1
an increasing sequence
approximating , such that
_

i
() < , and {
i
()}

i=1
the proper
priors obtained by renormalizing () within the
i
s.
For any data x with likelihood p(x| ), the sequence of posteriors

i
( | x) converges intrinsically to ( | x) p(x| ) ().
Normal data, known, () = 1.
(| x) p(x| , )()
exp[
n
2
2
(x )
2
]
(| x) = N(| x,
2
/n)
Example: n = 9, x = 2.11, = 4
- 4 - 2 0 2 4 6 8
0
0.2
0.4
0.6
0.8
1

i
(| x)
(| x)
20
Sequential updating
Prior and posterior are terms relative to a set of data.
If data x = {x
1
, . . . , x
n
} are sequentially presented, the nal result will
be the same whether data are globally or sequentially processed.
( | x
1
, . . . , x
i+1
) p(x
i+1
| ) ( | x
1
, . . . , x
i
).
The posterior at a given stage becomes the prior at the next.
Typically (but not always), the new posterior, ( | x
1
, . . . , x
i+1
), is
more concentrated around the true value than ( | x
1
, . . . , x
i
).
Posteriors (| x
1
, . . . , x
i
)
from increasingly large
simulated data from Poisson
Pn(x| ), with = 3
(| x
1
, . . . , x
i
)
= Ga(| r
i
+ 1/2, i)
r
i
=

i
j=1
x
j
1 2 3 4 5 6 7
0.5
1
1.5
2

n = 5
n = 10
n = 20
n = 50
n = 100
21
Nuisance parameters
In general the vector of interest is not the whole parameter vector , but
some function = () of possibly lower dimension.
By Bayes theorem ( | x) p(x| ) (). Let = () be
another function of such that = {, } is a bijection of , and let
J() = (/) be the Jacobian of the inverse function = ().
From probability theory, (| x) = |J()|[( | x)]
=()
and (| x) =
_

(, | x) d.
Any valid conclusion on will be contained in (| x).
Particular case: marginal posteriors
Often model directly expressed in terms of vector of interest , and
vector of nuisance parameters , p(x| ) = p(x| , ).
Specify the prior () = () (| )
Get the joint posterior (, | x) p(x| , ) (| ) ()
Integrate out , (| x) ()
_

p(x| , ) (| ) d
22
Example: Inferences about a Normal mean
Data x = {x
1
, . . . x
n
} random from N(x| ,
2
). Likelihood function
p(x| , )
n
exp[n{s
2
+ (x )
2
}/(2
2
)],
with nx =

i
x
i
, and ns
2
=

i
(x
i
x)
2
.
Objective prior is uniform in both and log(), i.e., (, ) =
1
.
Joint posterior (, | x)
(n+1)
exp[n{s
2
+(x)
2
}/(2
2
)].
Marginal posterior (| x)
_

0
(, | x) d [s
2
+(x)
2
]
n/2
,
kernel of the Student density St(| x, s
2
/(n 1), n 1)
Classroom experiment to
measure gravity g yields
x = 9.8087, s = 0.0428
with n = 20 measures.
(g | x, s, n)
= St(g | 9.9087, 0.0001
2
, 19)
Pr(9.788 < g < 9.829 | x)
= 0.95 (shaded area)
9.75 9.8 9.85 9.9
10
20
30
40
g
23
Restricted parameter space
Range of values of restricted by contextual considerations.
If known to belong to
c
, () > 0 iff
c
By Bayes theorem,
( | x,
c
) =
_

_
( | x)
_

c
( | x) d
,
if
c
0 otherwise
To incorporate a restriction, it sufces to renormalize the unrestricted
posterior distribution to the set
c
of admissible parameter values.
Classroom experiment to
measure gravity g with
restriction to lie between
g
g
0
= 9.7803 (equator)
g
1
= 9.8322 (poles).
Pr(9.7803 < g < 9.8322 | x)
= 0.95 (shaded area)
9.7 9.75 9.8 9.85 9.9
10
20
30
40
24
Asymptotic behaviour, discrete case
If the parameter space = {
1
,
2
, . . .} is countable and
The true parameter value
t
is distinguishable from the others,i.e.,
{p(x|
t
), p(x|
i
)) > 0, i = t,
lim
n
(
t
| x
1
, . . . , x
n
) = 1
lim
n
(
i
| x
1
, . . . , x
n
) = 0, i = t
To prove this, take logarithms is Bayes theorem,
dene z
i
= log[p(x|
i
)/p(x|
t
)],
and use the strong law of large numbers on the n
i.i.d. random variables z
1
, . . . , z
n
.
For instance, in probabilistic diagnosis the posterior probability of the
true disease converges to one as new relevant information accumulates,
provided the model distinguishes the probabilistic behaviour of data un-
der the true disease from its behaviour under the other alternatives.
25
Asymptotic behaviour, continuous case
If the parameter is one-dimensional and continuous, so that ,
and the model {p(x| ), x X} is regular: basically,
X does not depend on ,
p(x| ) is twice differentiable with respect to
Then, as n , ( | x
1
, . . . , x
n
) converges intrinsically
to a normal distribution with mean at the mle estimator

,
and with variance v(x
1
, . . . , x
n
,

), where
v
1
(x
1
, . . . , x
n
,

) =

n
j=1

2

2
log[p(x
j
| ]
To prove this, express is Bayes theorem as
( | x
1
, . . . , x
n
) exp[log () +

n
j=1
log p(x
j
| )],
and expand

n
j=1
log p(x
j
| )] about its maximum, the mle

The result is easily generalized to the case = {


1
, . . . ,
k
}, to obtain
a limiting multivariate Normal N
k
( |

, V (x
1
, . . . , x
n
,

)}.
26
Asymptotic behaviour, continuous case. Simpler form
Using the strong law of large numbers on the sums above a simpler, less
precise approximation is obtained:
If the parameter = {
1
, . . . ,
k
} is continuous, so that
k
and the model {p(x| ), x X} is regular; basically:
X does not depend on
p(x| ) is twice differentiable with respect to each of the
i
s
As n , ( | x
1
, . . . , x
n
) converges intrinsically to a multivariate
normal distribution N
k
{ |

, n
1
F
1
(

)} with mean the mle



and
precision (inverse of variance) matrix nF(

), where F is Fishers in-


formation matrix, of general element
F
ij
() = E
x|
_

2

j
log p(x| )
_
From this result, the properties of the multivariate Normal immediately
yield the asymptotic forms for the marginal and the conditional posterior
distributions of any subgroup of the
j
s.
27
Example: Asymptotic approximation with Poisson data
Data x = {x
1
, . . . , x
n
} random from Pn(x| ) e

x
hence, p(x| ) e
n

r
, r =
j
x
j
, and

= r/n.
Fishers function is F() = E
x|
_

2
log Pn(x| )
_
=
1

The objective prior function is () = F()


1/2
=
1/2
Hence (| x) e
n

r1/2
the of Ga(| r +
1
2
, n)
The Normal approximation is
(| x) N{|

, n
1
F
1
(

)}
= N{| r/n, r/n
2
}
Samples n = 5 and n = 25
simulated from Poisson = 3
yielded r = 19 and r = 82
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1

(| x)
28
2.2. Reference Analysis
No Relevant Initial Information
Identify the mathematical form of a noninformative prior. One with
minimal effect, relative to the data, on the posterior distribution of the
quantity of interest.
Intuitive basis:
Use information theory to measure the amount on information about the
quantity of interest to be expected from data. This depends on prior
knowledge: the more it is known, the less the amount of information the
data may be expected to provide.
Dene the missing information about the quantity of interest as that
which innite independent replications of the experiment could possible
provide.
Dene the reference prior as that which maximizes the missing informa-
tion about the quantity if interest.
29
Expected information from the data
Given model {p(x| ), x X, }, the amount of information
I

{X, ()} which may be expected to be provided by x, about the


value of is dened by
I

{X, () = E
x
[
_

( | x) log
( | x)
()
d],
the expected logarithmic divergence between prior and posterior.
Consider I

{X
k
, ()} the information about which may be expected
from k conditionally independent replications of the original setup.
As k , this would provide any missing information about . Hence,
as k , the functional I

{X
k
, ()} will approach the missing
information about associated with the prior ().
Let
k
() be the prior which maximizes I

{X
k
, ()} in the class P of
strictly positive prior distributions compatible with accepted assumptions
on the value of (which be the class of all strictly positive priors).
The reference prior

() is the limit as k (in a sense to be made


precise) of the sequence of priors {
k
(), k = 1, 2, . . .}.
30
Reference priors in the nite case
If may only take a nite number m of different values {
1
, . . . ,
m
}
and () = {p
1
, . . . , p
m
}, with p
i
= Pr( =
i
), then
lim
k
I

{X
k
, ()} = H(p
1
, . . . , p
m
) =

m
i=1
p
i
log(p
i
),
that is, the entropy of the prior distribution {p
1
, . . . , p
m
}.
In the nite case, the reference prior is that with maximumentropy within
the class P of priors compatible with accepted assumptions.
(cf. Statistical Physics)
If, in particular, P contains all priors over {
1
, . . . ,
m
}, the reference
prior is the uniform prior, () = {1/m, . . . , 1/m}.
(cf. Bayes-Laplace postulate of insufcient reason)
Prior {p
1
, p
2
, p
3
, p
4
}
in genetics problem
where p
1
= 2p
2
.
Reference prior is
{0.324, 0.162, 0.257, 0.257}
0.1
0.2
0.3
0.2
0.4
0.6
0.8
0
0.5
1
p
3
p
2
(0, 0)
H(p
2
, p
3
)
31
Reference priors in one-dimensional continuous case
Let
k
() be the prior which maximizes I

{X
k
, ()} in the class P of
acceptable priors.
For any data x X, let
k
( | x) p(x| )
k
() be
the corresponding posterior.
The reference posterior density

( | x) is dened to be the intrinsic


limit of the sequence {
k
( | x), k = 1, 2, . . .}
A reference prior function

() is any positive function such that,


for all x X,

( | x) p(x| )

().
This is dened up to an (irrelevant) arbitrary constant.
Let x
(k)
X
k
be the result of k independent replications of x X.
With calculus of variations, the exact expression for
k
() is found to be

k
() = exp
_
E
x
(k)
|
_
log
k
( | x
(k)
)
_
_
For large k, this allows a numerical derivation of the reference prior by
repeated simulation from p(x| ) for different values.
32
Reference priors under regularity conditions
Let

k
=

(x
(k)
) be a consistent, asymptotically sufcient estimator
of . In regular problems this is often the case with the mle estimator

.
The exact expression for
k
() then becomes, for large k,

k
() exp[E

k
|
{log
k
( |

k
)}]
As k this converges to
k
( |

k
)|

k
=
Let

k
=

(x
(k)
) be a consistent, asymptotically sufcient estimator
of . Let ( |

k
) be any asymptotic approximation to ( | x
(k)
), the
posterior distribution of .
Hence,

() = ( |

k
)|

k
=
Under regularity conditions, the posterior distribution of
is asymptotically Normal, N( |

, n
1
F
1
(

)), where
F() = E[
2
log p(x| )/
2
] is Fishers information function.
Hence,

() = F()
1/2
(cf. Jeffreys rule).
33
One nuisance parameter
Two parameters: reduce the problem to a sequential application of the
one parameter case. Probability model is {p(x| , , , } and
a -reference prior

(, ) is required. Two steps:


(i) Conditional on , p(x| , ) only depends on , and it is possible to
obtain the conditional reference prior

(| ).
(ii) If

(| ) is proper, integrate out to get the one-parameter model


p(x| ) =
_

p(x| , )

(| ) d, and use the one-parameter solu-


tion to obtain

().
The -reference prior is then

(, ) =

(| )

().
The required reference posterior is

( | x) p(x| )

().
If

(| ) is an improper prior function, proceed within an increasing


sequence {
i
} over which

(| ) is integrable and, for given data x,


obtain the corresponding sequence of reference posteriors {

i
( | x}.
The required reference posterior

( | x) is their intrinsic limit.


A -reference prior is any positive function such that, for any data x,

( | x)
_

p(x| , )

(, ) d.
34
The regular two-parameter continuous case
Model p(x| , ). If the joint posterior of (, ) is asymptotically nor-
mal, the -reference prior may be derived in terms of the corresponding
Fishers information matrix, F(, ).
F(, ) =
_
F

(, ) F

(, )
F

(, ) F

(, )
_
, S(, ) = F
1
(, ),
The -reference prior is

(, ) =

(| )

(), where

(| ) F
1/2

(, ), , and, if

(| ) is proper,

() exp {
_

(| ) log[S
1/2

(, )] d}, .
If

(| ) is not proper, integrations are performed within an approx-


imating sequence {
i
} to obtain a sequence {

i
(| )

i
()}, and the
-reference prior

(, ) is dened as its intrinsic limit.


Even if

(| ) is improper, if and are variation independent,


S
1/2

(, ) f

() g

(), and F
1/2

(, ) f

() g

(),
Then

(, ) = f

() g

().
35
Examples: Inference on normal parameters
The information matrix for the normal model N(x| , ) is
F(, ) =
_

2
0
0 2
2
_
, S(, ) =
_
_

2
0
0
2
/2
_
_
;
Since and are variation independent, and both F

and S

factorize

( | ) F
1/2


1
,

() S
1/2

1.
The -reference prior, as anticipated, is

(, ) =

( | )

() =
1
,
i.e., uniform on both and log
Since F(, ) is diagonal the -reference prior is

(, ) =

(| )

() =
1
, the same as

(, ) =

(, ).
In fact, it may be shown that, for location-scale models,
p(x| , ) =
1

f(
x

),
the reference prior for the location and scale parameters are always

(, ) =

(, ) =
1
.
36
Within any given model p(x| ) the -reference prior

() maximizes
the missing information about = () and, in multiparameter prob-
lems, that prior may change with the quantity of interest .
For instance, within a normal N(x| , ) model, let the standardized
mean = /. be the quantity of interest.
Fishers information matrix in terms of the parameters and is
F(, ) = J
t
F(, ) J, where J = ((, )/(, )) is the Jacobian
of the inverse transformation; this yields
F(, ) =
_
_
1
1

2
(2 +
2
)
_
_
,
with F
1/2


1
, and S
1/2

(1 +
2
/2)
1/2
.
The -reference prior is,

(, ) = (1 +
2
/2)
1/2

1
. Or, in the
original parametrization,

(, ) = (1 + (/)
2
/2)
1/2

2
,
which is different from

(, ) =

(, ).
This prior is shown to lead to a reference posterior for with consistent
marginalization properties.
37
Many parameters
The reference algorithm generalizes to any number of parameters.
If the model is p(x| ) = p(x|
1
, . . . ,
m
), a joint reference prior

(
m
|
m1
, . . . ,
1
) . . .

(
2
|
1
)

(
1
) may sequentially
be obtained for each ordered parametrization, {
1
(), . . . ,
m
()}.
Reference priors are invariant under reparametrization of the
i
()s.
The choice of the ordered parametrization {
1
, . . . ,
m
} describes the
particular prior required, namely that which sequentially
maximizes the missing information about each of the
i
s,
conditional on {
1
, . . . ,
i1
}, for i = m, m1, . . . , 1.
Example: Steins paradox. Data random from a m-variate normal
N
m
(x| , I). The reference prior function for any permutation of
the
i
s is uniform, and leads to appropriate posterior distributions for
any of the
i
s, but cannot be used if the quantity of interest is =

2
i
,
the distance of to the origin.
The reference prior for {,
1
, . . . ,
m1
} produces, for any choice of
the
i
s, an appropriate the reference posterior for .
38
2.3. Inference Summaries
Summarizing the posterior distribution
The Bayesian nal outcome of a problemof inference about any unknown
quantity is precisely the posterior density ( | x, C).
Bayesian inference may be described as the problem of stating a proba-
bility distribution for the quantity of interest encapsulating all available
information about its value.
In one or two dimensions, a graph of the posterior probability density
of the quantity of interest conveys an intuitive summary of the main
conclusions. This is greatly appreciated by users, and is an important
asset of Bayesian methods.
However, graphical methods not easily extend to more than two dimen-
sions and elementary quantitative conclusions are often required.
The simplest forms to summarize the information contained in the poste-
rior distribution are closely related to the conventional concepts of point
estimation and interval estimation.
39
Point Estimation: Posterior mean and posterior mode
It is often required to provide point estimates of relevant quantities.
Bayesian point estimation is best described as a decision problem where
one has to choose a particular value

as an approximate proxy for the
actual, unknown value of .
Intuitively, any location measure of the posterior density ( | x)
may be used as a point estimator. When they exist, either
E[ | x] =
_

( | x) d (posterior mean ), or
Mo[ | x] = arg sup

( | x) (posterior mode)
are often regarded as natural choices.
Lack of invariance. Neither the posterior mean not the posterior mode are
invariant under reparametrization. The point estimator

of a bijection
= () of will generally not be equal to (

).
In pure inferential applications, where one is requested to provide a
point estimate of the vector of interest without an specic application in
mind, it is difcult to justify a non-invariant solution.
40
Point Estimation: Posterior median
A summary of a multivariate density ( | x), where = {
1
, . . . ,
k
},
should contain summaries of:
(i) each of the marginal densities (
i
| x),
(ii) the densities (| x) of other functions of interest = ().
In one-dimensional continuous problems the posterior median,
is easily dened and computed as
Me[ | x] = q ;
_
{q}
( | x) d = 1/2
The one-dimensional posterior median has many attractive properties:
(i) it is invariant under bijections, Me[() | x] = (Me[ | x]).
(ii) it exists and it is unique under very wide conditions
(iii) it is rather robust under moderate perturbations of the data.
The posterior median is often considered to be the best automatic
Bayesian point estimator in one-dimensional continuous problems.
The posterior median is not easily used to a multivariate setting.
The natural extension of its denition produces surfaces (not points).
General invariant multivariate denitions of point estimators is possible
using Bayesian decision theory
41
General Credible Regions
To describe ( | x) it is often convenient to quote regions
p
of
given probability content p under ( | x). This is the intuitive basis of
graphical representations like boxplots.
A subset
p
of the parameter space such that
_

p
( | x) d = p, so that Pr(
p
| x) = p,
is a posterior p-credible region for .
A credible region is invariant under reparametrization:
If
p
is p-credible for , (
p
) is a p-credible for = ().
For any given p there are generally innitely many credible regions.
Credible regions may be selected to have minimum size (length, area,
volume), resulting in highest probability density (HPD) regions,
where all points in the region have larger probability density
than all points outside.
HPD regions are not invariant : the image (
p
) of an HPD region
p
will be a credible region for , but will not generally be HPD.
There is no reason to restrict attention to HPD credible regions.
42
Credible Intervals
In one-dimensional continuous problems, posterior quantiles are often
used to derive credible intervals.
If
q
= Q
q
[ | x] is the q-quantile of the posterior distribution of ,
the interval
p
= {;
p
} is a p-credible region,
and it is invariant under reparametrization.
Equal-tailed p-credible intervals of the form

p
= {;
(1p)/2

(1+p)/2
}
are typically unique, and they invariant under reparametrization.
Example: Model N(x| , ). Credible intervals for the normal mean.
The reference posterior for is (| x) = St(| x, s
2
/(n 1), n 1).
Hence the reference posterior distribution of =

n 1( x)/s,
a function of , is ( | x, s, n) = St( | 0, 1, n 1).
Thus, the equal-tailed p-credible intervals for are
{; x q
(1p)/2
n1
s/

n 1},
where q
(1p)/2
n1
is the (1 p)/2 quantile of a standard Student density
with n 1 degrees of freedom.
43
Calibration
In the normal example above , the expression t =

n 1( x)/s
may also be analyzed, for xed , as a function of the data.
The fact that the sampling distribution of the statistic t = t(x, s | , n)
is also an standard Student p(t | , n) = St(t | 0, 1, n 1) with the same
degrees of freedom implies that, in this example, objective Bayesian
credible intervals are also be exact frequentist condence intervals.
Exact numerical agreement between Bayesian credible intervals and
frequentist condence intervals is the exception, not the norm.
For large samples, convergence to normality implies approximate
numerical agreement. This provides a frequentist calibration to
objective Bayesian methods.
Exact numerical agreement is obviously impossible when the data are
discrete: Precise (non randomized) frequentist condence intervals do
not exist in that case for most condence levels.
The computation of Bayesian credible regions for continuous parameters
is however precisely the same whether the data are discrete or continuous.
44
2.4. Prediction
Posterior predictive distributions
Data x = {x
1
, . . . , x
n
}, x
i
X, set of homogeneous observations.
Desired to predict the value of a future observation x X generated by
the same mechanism.
From the foundations arguments the solution must be a probability dis-
tribution p(x| x, K) describing the uncertainty on the value that x will
take, given data x and any other available knowledge K. This is called
the (posterior) predictive density of x.
To derive p(x| x, K) it is necessary to specify the precise sense in which
the x
i
s are judged to be homogeneous.
It is often directly assumed that the data x = {x
1
, . . . , x
n
} consist of a
random sample from some specied model, {p(x| ), x X, },
so that p(x| ) = p(x
1
, . . . , x
n
| ) =

n
j=1
p(x
j
| ).
If this is the case, the solution to the prediction problem is immediate
once a prior distribution () has been specied.
45
Posterior predictive distributions from random samples
Let x = {x
1
, . . . , x
n
}, x
i
X a random sample of size n from the
statistical model {p(x| ), x X, }
Let () a prior distribution describing available knowledge (in any)
about the value of the parameter vector .
The posterior predictive distribution is
p(x| x) = p(x| x
1
, . . . , x
n
) =
_

p(x| ) ( | x) d
This encapsulates all available information about the outcome of any
future observation x X from the same model.
To prove this, make use the total probability theorem, to have
p(x| x) =
_

p(x| , x) ( | x) d
and notice the new observation x has been assumed to be conditionally
independent of the observed data x, so that p(x| , x) = p(x| ).
The observable values x X may be either discrete or continuous
random quantities. In the discrete the predictive distribution will be
described by its probability mass function; if the continuous case, by its
probability density function. Both are denoted p(x| x).
46
Prediction in a Poisson process
Data x = {r
1
, . . . , r
n
} random from Pn(r | ). The reference posterior
density of is

(| x) = Ga(| , t + 1/2, n), where t =


j
r
j
.
The (reference) posterior predictive distribution is
p(r | x) = Pr[r | t, n] =
_

0
Pn(r | ) Ga(| , t +
1
2
, n) d
=
n
t+1/2
(t + 1/2)
1
r!
(r +t + 1/2)
(1 +n)
r+t+1/2
,
an example of a Poisson-Gamma probability mass function.
For example, no ash oods have been recorded on a particular location
in 10 consecutive years. Local authorities are interested in forecasting
possible future ash oods. Using a Poisson model, and assuming that
meteorological conditions remain similar, the probabilities that r ash
oods will occur next year in that location are given by the Poisson-
Gamma mass function above, with t = 0 and n = 10. This yields,
Pr[0 | t, n] = 0.953, Pr[1 | t, n] = 0.043, and Pr[2 | t, n] = 0.003.
Many other situations may be described with the same model.
47
Prediction of Normal measurements
Data x = {x
1
, . . . , x
n
} random from N(x| ,
2
). Reference prior

(, ) =
1
or, in terms of the precision =
2
,

(, ) =
1
.
The joint reference posterior,

(, | x) p(x| , )

(, ), is

(, | x) = N(| x, (n)
1
) Ga(| (n 1)/2, ns
2
/2).
The predictive distribution is

(x| x) =
_

0
_

N(x| ,
1
)

(, | x) dd
= {(1 +n)s
2
+ ( x)
2
}
n/2
,
which is a kernel of the Student density
p(x| x) = St(x| x, s
2 n+1
n1
, n 1).
Example. Production of safety belts. Observed breaking strengths of 10
randomly chosen webbings have mean x = 28.011 kN and standard
deviation s = 0.443 kN. Specication requires x > 26 kN.
Reference posterior predictive p(x| x) = St(x| 28.011, 0.490, 9).
Pr(x > 26 | x) =
_

26
St(x| 28.011, 0.240, 9) dx = 0.9987.
48
Regression
Often additional information from relevant covariates. Data structure,
set of pairs x = {(y
1
, v
1
), . . . (y
n
, v
n
)}; y
i
, v
i
, both vectors. Given a
new observation, with v known, predict the corresponding value of y.
Formally, compute p{y | v, (y
1
, v
1
), . . . (y
n
, v
n
)}.
Need a model {p(y | v, ), y Y , } which makes precise the
probabilistic relationship between y and v. The simplest option assumes
a linear dependency of the form p(y | v, ) = N(y | V , ), but far
more complex structures are common in applications.
Univariate linear regression on k covariates. Y , v = {v
1
, . . . , v
k
}.
p(y | v, , ) = N(y | v,
2
), = {
1
, . . . ,
k
}
t
. Data x = {y, V },
y = {y
1
, . . . , y
n
}
t
, and V is the n k matrix with the v
i
s as rows.
p(y | V , , ) = N
n
(y | V ,
2
I
n
); reference prior

(, ) =
1
.
Predictive posterior is the Student density
p(y | v, y, V ) = St(y | v

, f(v, V )
ns
2
nk
, n k)

= (V
t
V )
1
V
t
y, ns
2
= (y v

)
t
(y v

)
f(v, V ) = 1 +v(V
t
V )
1
v
t
49
Example: Simple linear regression
One covariate and a constant term; p(y | v, , ) = N(y |
1
+
2
v,
2
)
Sufcient statistic is t = {v, y, s
vy
, s
vv
}, with nv = v
j
, ny = y
j
,
s
yv
= v
j
y
j
/n v y, s
vv
= v
2
j
/n v
2
.
p(y | v, t) = St(y |

1
+

2
v, f(v, t)
ns
2
n2
, n 2)

1
= y

2
v,

2
=
s
vy
s
vv
,
ns
2
=

n
j=1
(y
j

2
x
j
)
2
f(v, t) = 1 +
1
n
(vv)
2
+s
vv
s
vv
Pollution density (gr/m
3
), and
wind speed from source (m/s ).
y
j
1212 836 850 446 1164 601
v
j
4.8 3.3 3.1 1.7 4.7 2.1
y
j
1074 284 352 1064 712 976
v
j
3.9 0.9 1.4 4.3 2.9 3.4
Pr[y > 50 | v = 0, x] = 0.66
250 500 750 1000 1250 1500
0.002
0.004
0.006
0.008
1 2 3 4 5
200
400
600
800
1000
1200
1400
v
v
p(y | v, x)
y
50
2.4. Hierarchical Models
Exchangeability
Random quantities are often homogeneous in the precise sense that
only their values matter, not the order in which they appear. Formally,
this is captured by the notion of exchangeability. The set of random vec-
tors {x
1
, . . . , x
n
} is exchangeable if their joint distribution is invariant
under permutations. An innite sequence {x
j
} of random vectors is
exchangeable if all its nite subsequences are exchangeable.
Any randomsample fromany model is exchangeable. The representation
theorem establishes that if observations {x
1
, . . . , x
n
} are exchangeable,
they are a a randomsample fromsome model {p(x| ), }, labeled
bya parameter vector , denedas the limit (as n ) of some function
of the x
i
s. Information about in prevailing conditions C is necessarily
described by some probability distribution ( | C).
Formally, the joint density of any nite set of exchangeable observations
{x
1
, . . . , x
n
} has an integral representation of the form
p(x
1
, . . . , x
n
| C) =
_

n
i=1
p(x
i
| ) ( | C) d.
51
Structured Models
Complex data structures may often be usefully described by partial ex-
changeability assumptions.
Example: Public opinion. Sample k different regions in the country.
Sample n
i
citizens in region i and record whether or not (y
ij
= 1 or
y
ij
= 0) citizen j would vote A. Assuming exchangeable citizens
within each region implies
p(y
i1
, . . . , y
in
i
) =

n
i
j=1
p(y
ij
|
i
) =
r
i
i
(1
i
)
n
i
r
i
,
where
i
is the (unknown) proportion of citizens in region i voting Aand
r
i
=
j
y
ij
the number of citizens voting A in region i.
Assuming regions exchangeable within the country similarly leads to
p(
1
, . . . ,
k
) =

k
i=1
(
i
| )
for some probability distribution ( | ) describing the political varia-
tion within the regions. Often choose ( | ) = Be( | , ).
The resulting two-stages hierarchical Binomial-Beta model
x = {y
1
, . . . , y
k
}, y
i
= {y
i1
, . . . , y
in
i
}, random from Bi(y |
i
),
{
1
, . . . ,
k
}, random from Be( | , )
provides a far richer model than (unrealistic) simple binomial sampling.
52
Example: Biological response. Sample k different animals of the same
species in specic environment. Control n
i
times animal i and record
his responses {y
i1
, . . . , y
in
i
} to prevailing conditions. Assuming ex-
changeable observations within each animal implies
p(y
i1
, . . . , y
in
i
) =

n
i
j=1
p(y
ij
|
i
).
Often choose p(y
ij
|
i
) = N
r
(y |
i
,
1
), where r is the number of
biological responses measured.
Assuming exchangeable animals within the environment leads to
p(
1
, . . . ,
k
) =

k
i=1
(
i
| )
for some probability distribution (| ) describing the biological vari-
ation within the species. Often choose (| ) = N
r
(|
0
,
2
).
The two-stages hierarchical multivariate Normal-Normal model
x = {y
1
, . . . , y
k
}, y
i
= {y
i1
, . . . , y
in
i
}, random from N
r
(y |
i
,
1
),
{
1
, . . . ,
k
}, random from N
r
(|
0
,
2
)
provides a far richer model than (unrealistic) simple multivariate normal
sampling.
Finer subdivisions, e.g., subspecies within each species, similarly lead
to hierarchical models with more stages.
53
Bayesian analysis of hierarchical models
A two-stages hierarchical model has the general form
x = {y
1
, . . . , y
k
}, y
i
= {z
i
1
, . . . , z
i
n
i
}
y
i
random sample of size n
i
from p(z |
i
),
i
,
{
1
, . . . ,
k
}, random of size k from ( | ), .
Specify a prior distribution (or a reference prior function)
() for the hyperparameter vector .
Use standard probability theory to compute all desired
posterior distributions:
(| x) for inferences about the hyperparameters,
(
i
| x) for inferences about the parameters,
( | x) for inferences about the any function = (
1
, . . . ,
k
)
of the parameters,
(y | x) for predictions on future observations,
(t | x) for predictions on any function t = t(y
1
, . . . , y
m
)
of m future observations
Markov Chain Monte Carlo based software available for the necessary
computations.
54
3. Decision Making
3.1 Structure of a Decision Problem
Alternatives, consequences, relevant events
A decision problem if two or more possible courses of action; A is the
class of possible actions.
For each a A,
a
is the set of relevant events, those may affect the
result of choosing a.
Each pair {a, },
a
, produces a consequence c(a, ) C
a
. In this
context, if often referred to as the parameter of interest.
The class of pairs {(
a
, C
a
), a A} describes the structure of the
decision problem. Without loss of generality, it may be assumed that the
possible actions are mutually exclusive, for otherwise the appropriate
Cartesian product may be used.
In many problems the class of relevant events
a
is the same for all
a A. Even if this is not the case, a comprehensive parameter space
may be dened as the union of all the
a
.
55
Foundations of decision theory
Different sets of principles capture a minimumcollection of logical rules
required for rational decision-making.
These are axioms with strong intuitive appeal.
Their basic structure consists of:
The Transitivity of preferences:
If a
1
> a
2
given C, and a
2
> a
3
given C,
then a
1
> a
3
given C.
The Sure-thing principle:
If a
1
> a
2
given C and E, and a
1
> a
2
given C and not E
then a
1
> a
2
given C.
The existence of Standard events:
There are events of known plausibility.
These may be used as a unit of measurement, and
have the properties of a probability measure
These axioms are not a description of human decision-making,
but a normative set of principles dening coherent decision-making.
56
Decision making
Many different axiom sets.
All lead basically to the same set of conclusions, namely:
The consequences of wrong actions should be evaluated in terms of a
real-valued loss function L(a, ) which species, on a numerical scale,
their undesirability.
The uncertainty about the parameter of interest should be measured
with a probability distribution ( | C)
( | C) 0, ,
_

( | C) d = 1,
describing all available knowledge about its value, given the conditions C
under which the decision must be taken.
The relative undesirability of available actions a A is measured by
their expected loss
L(a | C) =
_

L(a, ) ( | C) d, a A.
57
Intrinsic loss functions: Intrinsic discrepancy
The loss function is typically context dependent.
In mathematical statistics, intrinsic loss functions are used to measure
the distance between between statistical models.
They measure the divergence between the models {p
1
(x|
1
), x X}
and {p
2
(x|
2
), x X} as some non-negative function of the form
L[p
1
(x|
1
), p
2
(x|
2
)] which is zero if (and only if) the two distribu-
tions are equal almost everywhere.
The intrinsic discrepancy between two statistical models is simply the
intrinsic discrepancy between their sampling distributions, i.e.,
{p
1
, p
2
} = {
1
,
2
}
= min
_
_
X
p
1
(x|
1
) log
p
1
(x|
1
)
p
2
(x|
2
)
dx,
_
X
p
2
(x|
2
) log
p
2
(x|
2
)
p
1
(x|
1
)
dx
_
The intrinsic discrepancy is an information-based, symmetric, invariant
intrinsic loss.
58
3.2 Formal Point Estimation
The decision problem
Given statistical model {p(x| ), x X, }, quantity of interest
= () . A point estimator

=

(x) of is some function of
the data to be regarded as a proxy for the unknown value of .
To choose a point estimate for is a decision problem, where the action
space is A = .
Given a loss function l(

, ), the posterior expected loss is


L[

| x] =
_

L(

, ) ( | x) d,
The corresponding Bayes estimator is that function of the data,

(x) = arg inf

L[

| x],
which minimizes that expectation.
59
Conventional estimators
The posterior mean and the posterior mode are the Bayes estimators
whichrespectivelycorrespondtoa quadratic ana zero-one loss functions.
If L(

, ) = (

)
t
(

), then, assuming that the mean exists, the


Bayes estimator is the posterior mean E[ | x].
If the loss function is a zero-one function, so that L(

, ) = 0 if

belongs to a ball of radius centered in and L(

, ) = 1 otherwise
then, , assuming that a unique mode exists, the Bayes estimator converges
to the posterior mode Mo[ | x] as the ball radius tends to zero.
If is univariate and continuous, and the loss function is linear,
L(

, ) =
_
c
1
(

) if


c
2
(

) if

<
then the Bayes estimator is the posterior quantile of order c
2
/(c
1
+c
2
),
so that Pr[ <

] = c
2
/(c
1
+c
2
).
In particular, if c
1
= c
2
, the Bayes estimator is the posterior median.
Any value may be optimal: it all depends on the loss function.
60
Intrinsic estimation
Given the statistical model {p(x| ), x X, } the intrinsic dis-
crepancy (
1
,
2
) between two parameter values
1
and
2
is the in-
trinsic discrepancy {p(x|
1
), p(x|
2
)} between the corresponding
probability models.
This is symmetric, non-negative (and zero iff
1
=
2
), invariant under
reparametrization and invariant under bijections of x.
The intrinsic estimator is the reference Bayes estimator which
corresponds to the loss dened by the intrinsic discrepancy:
The expected loss with respect to the reference posterior distribution
d(

| x) =
_

, }

( | x) d
is an objective measure, in information units, of the expected discrepancy
between the model p(x|

) and the true (unknown) model p(x| ).


The intrinsic estimator

(x) is the value which minimizes such


expected discrepancy,

= arg inf

d(

| x).
61
Example: Intrinsic estimation of the Binomial parameter
Data x = {x
1
, . . . , x
n
}, random from p(x| ) =
x
(1 )
1x
,
r = x
j
. Intrinsic discrepancy (

, ) = n min{k(

| ), k( |

)},
k(
1
|
2
) =
2
log

2

1
+ (1
2
) log
1
2
1
1
,

() = Be( |
1
2
,
1
2
)

( | r, n) = Be( | r +
1
2
, n r +
1
2
).
Expected reference discrepancy
d(

, r, n) =
_
1
0
(

, )

( | r, n) d
Intrinsic estimator

(r, n) = arg min


0<

<1
d(

, r, n)
From invariance, for any bijection
= (),

= (

).
Analytic approximation

(r, n)
r+1/3
n+2/3
, n > 2
n = 12, r = 0,

(0, 12) = 0.026


Me[ | x] = 0.018, E[ | x] = 0.038
0 10 20 30 40 50
0.02
0.04
0.06
0.08
0.1
0.05 0.1 0.15 0.2
10
20
30
40
50
60
n

(0, n)

( | 0, 12)
62
3.3 Hypothesis Testing
Precise hypothesis testing as a decision problem
The posterior ( | D) conveys intuitive information on the values of
which are compatible with the observed data x: those with a relatively
high probability density.
Often a particular value
0
is suggested for special consideration:
Because =
0
would greatly simplify the model
Because there are context specic arguments suggesting that =
0
More generally, one may analyze the restriction of parameter space
to a subset
0
which may contain more than one value.
Formally, testing the hypothesis H
0
{ =
0
} is a decision problem
with just two possible actions:
a
0
: to accept H
0
and work with p(x|
0
).
a
1
: to reject H
0
and keep the general model p(x| ).
To proceed, a loss function L(a
i
, ), , describing the possible
consequences of both actions, must be specied.
63
Structure of the loss function
Given data x, optimal action is to reject H
0
(action a
1
) iff the expected
posterior loss of accepting,
_

L(a
0
, ) ( | x) d, is larger than the
expected posterior loss of rejecting,
_

L(a
1
, ) ( | x) d, i.e., iff
_

[L(a
0
, ) L(a
1
, )] ( | x) d =
_

L() ( | x) d > 0.
Therefore, only the loss difference L() = L(a
0
, )L(a
1
, ), which
measures the advantage of rejecting H
0
as a function of , has to be
specied: The hypothesis should be rejected whenever the expected
advantage of rejecting is positive.
The advantage L() of rejecting H
0
as a function of should be of
the form L() = l(
0
, ) l

, for some l

> 0, where
l(
0
, ) measures the discrepancy between p(x|
0
) and p(x| ),
l

is a positive utility constant which measures the advantage working


with the simpler model when it is true.
The Bayes criterion will then be: Reject H
0
if (and only if)
_

l(
0
, ) ( | x) d > l

, that is if (and only if)


the expected discrepancy between p(x|
0
) and p(x| ) is too large.
64
Bayesian Reference Criterion
An good choice for the function l(
0
, ) is the intrinsic discrepancy,
(
0
, ) = min {k(
0
| ), k( |
0
)},
where k(
0
| ) =
_
X
p(x| ) log{p(x| )/p(x|
0
)}dx.
If x = {x
1
, . . . , x
n
} X
n
is a random sample from p(x| ), then
k(
0
| ) = n
_
X
p(x| ) log
p(x| )
p(x|
0
)
dx.
For objective results, exclusively based on model assumptions and data,
the reference posterior distribution

( | x) should be used.
Hence, reject if (and only if) the expected reference posterior intrinsic
discrepancy d(
0
| x) is too large,
d(
0
| x) =
_

(
0
, )

( | x) d > d

, for some d

> 0.
This is the Bayesian reference criterion (BRC).
The reference test statistic d(
0
| x) is nonnegative, it is invariant both
under reparametrization and under sufcient transformation of the data,
and it is a measure, in natural information units (nits) of the expected
discrepancy between p(x|
0
) and the true model.
65
Calibration of the BRC
The reference test statistic d(
0
| x) is the posterior expected discrepancy
of the intrinsic discrepancy between p(x|
0
) and p(x| ). Hence,
A reference test statistic value d(
0
| x) of, say, log(10) = 2.303 nits
implies that, given data x, the average value of the likelihood ratio
against the hypothesis, p(x| )/p(x|
0
), is expected to be about 10,
suggesting some mild evidence against
0
.
Similarly, a value d(
0
| x) of log(100) = 4.605 indicates an average
value of the likelihood ratio against
0
of about 100, indicating rather
strong evidence against the hypothesis, and log(1000) = 6.908, a rather
conclusive likelihood ratio against the hypothesis of about 1000.
As expected, there are strong connections between the BRC criterion for
precise hypothesis testing and intrinsic estimation:
The intrinsic estimator is the value of with minimizes the reference
test statistic:

= arg inf

d( | x).
The regions dened by {; d( | x) d

} are invariant reference


posterior q(d

)-credible regions for . For regular problems and large


samples, q(log(10)) 0.95 and q(log(100)) 0.995.
66
A canonical example: Testing a value for the Normal mean
In the simplest case where the variance
2
is known,
(
0
, ) = n(
0
)
2
/(2
2
),

(| x) = N(| x,
2
/n),
d(
0
| x) =
1
2
(1 +z
2
), z =
x
0
/

n
Thus rejecting =
0
if d(
0
| x) > d

is equivalent to rejecting if
|z| >

2d

1 and, hence, to a conventional two-sided frequentist test


with signicance level = 2(1 (|z|)).
d

|z|
log(10) 1.8987 0.0576
log(100) 2.8654 0.0042
log(1000) 3.5799 0.0003
The expected value of d(
0
| x)
if the hypothesis is true is

1
2
(1 +z
2
)N(z | 0, 1) dz = 1
- 4 - 2 0 2 4
0
2
4
6
8
z
d(
0
| x) = (1 +z
2
)/2
67
Fishers tasting tea lady
Data x = {x
1
, . . . , x
n
}, random from p(x| ) =
x
(1 )
1x
,
r = x
j
. Intrinsic discrepancy (
0
, ) = n min{k(
0
| ), k( |
0
)},
k(
1
|
2
) =
2
log

2

1
+ (1
2
) log
1
2
1
1
,

() = Be( |
1
2
,
1
2
)
Intrinsic test statistic
d(
0
| x) =
_
1
0
(

, )

( | x) d
Fishers example: x = {10, 10}
Test
0
= 1/2,

(x) = 0.9686
d(

| x)

Pr[ <

| x]
log[10] 0.711 0.99185
log[100] 0.547 0.99957
log[1000] 0.425 0.99997
Using d

= log[100] = 4.61,
the value = 0.5 is rejected.
Pr[ < 0.5 | x] = 0.00016
0.4 0.5 0.6 0.7 0.8 0.9 1
2.5
5
7.5
10
12.5
15
17.5
20
0.4 0.5 0.6 0.7 0.8 0.9 1
1
2
3
4
5
6
7

( | 10, 10)
d(
0
| 10, 10)
68
Asymptotic approximation
For large samples, the posterior approaches N( |

, n
1
F
1
(

)), where
F() is Fishers information function. Changing variables, the
posterior distribution of = () =
_
F
1/2
() d = 2 arc sin

) is
approximately normal N(|

, n
1
). Since d(, x) is invariant,
d(
0
, x)
1
2
[1 +n{(
0
) (

)}
2
].
Testing for a majority (
0
= 0.5)
x = {720, 1500},

(x) = 0.4800
d(

| x) R = (

0
,

1
) Pr[ R| x]
log[10] (0.456, 0.505) 0.9427
log[100] (0.443, 0.517) 0.9959
log[1000] (0.434, 0.526) 0.9997
Very mild evidence against = 0.5:
d(0.5 | 720, 1500) = 1.67
Pr( < 0.5 | 720, 1500) = 0.9393
0.44 0.46 0.48 0.5 0.52 0.54
5
10
15
20
25
30
0.42 0.44 0.46 0.48 0.5 0.52 0.54
1
2
3
4
5
6
7
8

( | x)
d(
0
| x)
69
Basic References
Bernardo, J. M. (2003). Bayesian Statistics.
Encyclopedia of Life Support Systems (EOLSS). Paris: UNESCO. (in press)
On line: http://www.uv.es/bernardo/
Gelman, A., Carlin, J. B., Stern, H. and Rubin, D. B. (1995).
Bayesian Data Analysis. London: Chapman and Hall.
Bernardo, J. M. and Smith, A. F. M. (1994).
Bayesian Theory. Chichester: Wiley.
Bernardo, J. M. and Ramn, J. M. (1998).
An introduction to Bayesian reference analysis: Inference on the ratio of multino-
mial parameters. The Statistician 47, 135.
Bernardo, J. M. and Rueda, R. (2002). Bayesian hypothesis testing: A reference
approach. Internat. Statist. Rev. 70, 351372.
Bernardo, J. M. and Jurez, M. (2003). Intrinsic estimation. Bayesian Statis-
tics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman,
A. F. M. Smith and M. West, eds.). Oxford: University Press, 465-476.

S-ar putea să vă placă și