Sunteți pe pagina 1din 101

Week 1:

Models, Estimation and


Confidence
Richard McElreath
2011
Statistical Rethinking
Statistical
Rituals
Angst
angst|aNG(k)st, NG(k)st|
noun
a feeling of deep anxiety or dread, typically an unfocused one about the human
condition or the state of the world in general: adolescent angst.
informal a feeling of persistent worry about something trivial: my hair causes me angst.
Statistical
Thinking
Frequentistown
Procedures
Tests
Falsification
Bayesville
Models
Probabilities
Support
Course Goals

Introduce model-based inferential statistics

Bayesian inference

Generalized linear multilevel models

Formal model comparison

Use R
Tyrannies and Frontiers

Three tyrannies:
(1) Popperism, (2) Fisher, (3) Gauss

Three frontiers:
(1) Bayes and MCMC,
(2) Generalized linear multilevel models,
(3) Formal modal comparison
Bayesian inference and MCMC

Based on axioms of probability

Computationally difficult

Markov Chain Monte Carlo


(MCMC) to the rescue

Used to be controversial
Laplace (17491827)
Multilevel models

Generalized linear models


with multiple levels of
stochasticity

Allow for more satisfying


hypotheses

Conceptually tied to Bayes


Formal model comparison

Basic problem is overfitting

Solution: Penalize complex


models

In practice: Criteria like


AIC/DIC and BIC

All of these are Bayesian


R is the right tool

R Environment for Statistical


Computing

Free

Not user friendly

The most powerful tool


available
Course procedures

Notes on the course website

Lectures recorded each week, so dont have to


attend lecture

Homework: Assigned Thurs, due next Thurs

Should work together on homework

Final exam: Take-home, work alone


Goals this week

Meet Bayes theorem

Estimate posterior

Simulate predictions
Ten tosses of the globe:
W L W W L W L L W W
D = { }
Hypotheses and Models

Hypothesis: Conjectural explanation

Model: Mathematical version of an hypothesis;


typically many models for any one hypothesis
Hypotheses and Models

Hypothesis: Proportion of water on Earth is 0.7.

Model: Tosses are independent, and each has


chance 0.7 of being W. Sample size
independent of outcomes.
Many models in play

A different model for each


possible proportion of water
on the planet

Call proportion of water p


W
,
a parameter

Each value of p
W
a different
model, M
p
W
p
W
= 0.23
p
W
= 0.97
A measure of confidence

Want to quantify strength of evidential support


for each model

Measure should be continuous

Only relative values will matter

Standardized confidence value will be a


probability:
2.2. HYlTHLSLS /ND MDLLS 33
lor now, tle important point is tlat tlere are an infinite number
of tlese moJels, one for eacl possible value Q
8
. lurtlermore, tlere are
many otler assumptions about lowtle Jata arise tlat woulJ clan,e tle
moJel but leave tle lypotlesis unclan,eJ.
.FBTVSJOH DPOGJEFODF JO NPEFMT So we lave some moJels, now.
Wlat can we Jo witl tlem` Wlat we'J like to be able to Jo is say wlat
tle true proportion of water is. Tlat is, we'J like to pick out one of tlese
moJels anJ lolJ it up as true.
Lut clearly tle eviJence is not sufficient in tlis case, nor even in
most cases, to select a sin,le value of Q
8
to label USVF, wlile labelin,
all of tle otlers GBMTF. ln tle sample at lanJ, for example, 6 out of 1O
samples were W, inJicatin, water. nly tle most overconfiJent analyst
woulJ tlen concluJe tlat 6/1O = O.6 is tle true value of Q
8
, wlile all
otlers are Jemonstrably false. Tlis is because it woulJ be quite easy
for some otler value, say Q
8
= O.7, to ,enerate tle same observation.
Moreover, we know a priori tlat all moJels are false, so it makes little
sense to talk about true moJels anJ false moJels.
lnsteaJ, we require some intermeJiate measure of confiJence in eacl
moJel, in its usefulness. / sensible measure tlat turns out to be very
proJuctive in practice is tle probability tlat eacl moJel is tle best ap-
proximation-compareJ only to tle otler canJiJates-to tle true un-
Jerlyin, process. lor a moJel witl Q
8
= O.6, call tlis probability:
lr(.
O.6
|%),
wlere .
O.6
is a particular moJel witl Q
8
= O.6, anJ % is tle eviJence,
tle Jata. Note tlat tle moJel is Jefinitely not only Q
8
= O.6, because
tlere are otler assumptions in tle moJel. Still, tle only assumption
we're ,oin, to vary amon, tle moJels in tlis example is tle value of Q
8
.
leaJ tle expression lr(.
O.6
|%) as UIF QSPCBCJMJUZ PG .
O.6
DPOEJUJPOBM
PO %. Tlis is intenJeJ as tle probability tlat .
O.6
is tle best approxi-
matin, moJel, not tle probability tlat .
O.6
is true". ln later clapters,
tle Jifference tlis makes will become important (Clapter 5 anJ up).
lor now, it'll be alri,lt to proceeJ witl just a mental bookmark, a brain
itcl to reminJ you tlat tlis business about moJels bein, approximations
Joes sometimes matter.
Now, if we lave a moJel witl Q
8
= O.6, tlen tle confiJence in
tlis moJel is measureJ by lr(.
O.6
|%). Tlere are an infinite number of
otler moJels witl otler values of Q
8
. Lacl of tlem also las a measure
of confiJence, lr(.
Q
8
|%). Tle entire ensemble of all of tlese moJels
comprises a probability Jistribution, because if you aJJ up all of tle inJi-
viJual moJel confiJence measures, lr(.
Q
8
|%), tley will aJJ up to one.
A measure of confidence

Probability p
W
is best model in set, conditional
on the evidence, D.

Not probability p
W
is true All models are
false.

These probabilities are posterior probabilities


2.2. HYlTHLSLS /ND MDLLS 33
lor now, tle important point is tlat tlere are an infinite number
of tlese moJels, one for eacl possible value Q
8
. lurtlermore, tlere are
many otler assumptions about lowtle Jata arise tlat woulJ clan,e tle
moJel but leave tle lypotlesis unclan,eJ.
.FBTVSJOH DPOGJEFODF JO NPEFMT So we lave some moJels, now.
Wlat can we Jo witl tlem` Wlat we'J like to be able to Jo is say wlat
tle true proportion of water is. Tlat is, we'J like to pick out one of tlese
moJels anJ lolJ it up as true.
Lut clearly tle eviJence is not sufficient in tlis case, nor even in
most cases, to select a sin,le value of Q
8
to label USVF, wlile labelin,
all of tle otlers GBMTF. ln tle sample at lanJ, for example, 6 out of 1O
samples were W, inJicatin, water. nly tle most overconfiJent analyst
woulJ tlen concluJe tlat 6/1O = O.6 is tle true value of Q
8
, wlile all
otlers are Jemonstrably false. Tlis is because it woulJ be quite easy
for some otler value, say Q
8
= O.7, to ,enerate tle same observation.
Moreover, we know a priori tlat all moJels are false, so it makes little
sense to talk about true moJels anJ false moJels.
lnsteaJ, we require some intermeJiate measure of confiJence in eacl
moJel, in its usefulness. / sensible measure tlat turns out to be very
proJuctive in practice is tle probability tlat eacl moJel is tle best ap-
proximation-compareJ only to tle otler canJiJates-to tle true un-
Jerlyin, process. lor a moJel witl Q
8
= O.6, call tlis probability:
lr(.
O.6
|%),
wlere .
O.6
is a particular moJel witl Q
8
= O.6, anJ % is tle eviJence,
tle Jata. Note tlat tle moJel is Jefinitely not only Q
8
= O.6, because
tlere are otler assumptions in tle moJel. Still, tle only assumption
we're ,oin, to vary amon, tle moJels in tlis example is tle value of Q
8
.
leaJ tle expression lr(.
O.6
|%) as UIF QSPCBCJMJUZ PG .
O.6
DPOEJUJPOBM
PO %. Tlis is intenJeJ as tle probability tlat .
O.6
is tle best approxi-
matin, moJel, not tle probability tlat .
O.6
is true". ln later clapters,
tle Jifference tlis makes will become important (Clapter 5 anJ up).
lor now, it'll be alri,lt to proceeJ witl just a mental bookmark, a brain
itcl to reminJ you tlat tlis business about moJels bein, approximations
Joes sometimes matter.
Now, if we lave a moJel witl Q
8
= O.6, tlen tle confiJence in
tlis moJel is measureJ by lr(.
O.6
|%). Tlere are an infinite number of
otler moJels witl otler values of Q
8
. Lacl of tlem also las a measure
of confiJence, lr(.
Q
8
|%). Tle entire ensemble of all of tlese moJels
comprises a probability Jistribution, because if you aJJ up all of tle inJi-
viJual moJel confiJence measures, lr(.
Q
8
|%), tley will aJJ up to one.
2.2. HYlTHLSLS /ND MDLLS 33
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
0
0
.
0
2
0
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
0
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
5
0
.
0
3
0
p
W
c
o
n
f
i
d
e
n
c
e
'JHVSF Lxample probability Jistributions
lr(.
Q
8
|%). Tlese are commonly known as QPTUFSJPS
probability Jensities. Tlese Jistributions may be wiJe
or narrow, reflectin, Jifferences in uncertainty across
moJels. Tley may also be skeweJ or contain more tlan
one peak.
$PNQVUJOH DPOGJEFODF JO NPEFMT kay, so low Jo you com-
pute tlese probabilities, lr(.
Q
8
|%)` Well, since we are talkin, about a
probability, you can just use tle axioms of probability tleory to Jefine
it. ln tlis Jefinition, tlere will be pieces tlat tell us wlat information
we require in orJer to complete tle analysis.
Figure 2.1. Example posterior probability distributions.
3
6
2
.
M

D
L
L
S
,
L
S
T
l
M
/
T
l

N
/
N
D
C

N
l
l
D
L
N
C
L
H
e
r
e
'
s
t
l
e
q
u
i
c
k
e
s
t
w
a
y
t
o
J
e
r
i
v
e
a
f
o
r
m
u
l
a
f
o
r
l
r
(
.
Q
8
|
%
)
.
Y
o
u
r
e
-
a
l
l
y
s
l
o
u
l
J
p
a
y
a
t
t
e
n
t
i
o
n
t
o
t
l
i
s
J
e
r
i
v
a
t
i
o
n
,
b
e
c
a
u
s
e
w
o
r
k
i
n
,
t
l
r
o
u
,
l
i
t
y
o
u
r
s
e
l
f
w
i
l
l
l
e
l
p
y
o
u
r
e
m
e
m
b
e
r
t
l
e
f
o
r
m
u
l
a
f
o
r
l
r
(
.
Q
8
|
%
)
.
/
n
J
i
f
y
o
u
e
v
e
r
f
o
r
,
e
t
t
l
e
e
x
a
c
t
f
o
r
m
u
l
a
,
y
o
u
c
a
n
j
u
s
t
f
i
n
J
i
t
a
,
a
i
n
,
b
e
c
a
u
s
e
t
l
e
J
e
r
i
v
a
t
i
o
n
r
e
a
l
l
y
i
s
t
l
a
t
s
i
m
p
l
e
.
l
n
o
w
i
n
,
l
o
w
s
i
m
p
l
e
i
t
i
s
t
o
p
r
o
J
u
c
e
t
l
i
s
f
o
r
m
u
l
a
w
i
l
l
a
l
s
o
s
e
r
v
e
t
o
i
m
p
r
e
s
s
u
p
o
n
y
o
u
j
u
s
t
l
o
w
b
a
s
i
c
i
t
i
s
t
o
p
r
o
b
a
b
i
l
i
t
y
t
l
e
o
r
y
.
l
t
'
s
n
o
t
a
f
a
n
c
y
r
e
s
u
l
t
a
t
a
l
l
,
J
e
s
p
i
t
e
i
t
s
i
m
p
o
r
t
a
n
c
e
.
T
l
e
p
r
o
b
a
b
i
l
i
t
y
l
r
(
.
Q
8
|
%
)
i
s
a
D
P
O
E
J
U
J
P
O
B
M
p
r
o
b
a
b
i
l
i
t
y
.
l
t
s
a
y
s
t
l
a
t
t
l
e
p
r
o
b
a
b
i
l
i
t
y
o
f
.
Q
8
i
s
c
o
n
J
i
t
i
o
n
e
J
o
n
%
.
T
l
e
J
e
f
i
n
i
t
i
o
n
o
f
c
o
n
J
i
t
i
o
n
a
l
p
r
o
b
a
b
i
l
i
t
y
i
s
e
m
b
e
J
J
e
J
i
n
a
c
o
m
m
o
n
p
r
o
b
a
b
i
l
i
t
y
r
u
l
e
:
l
r
(
.
Q
8
,
%
)
=
l
r
(
.
Q
8
|
%
)
l
r
(
%
)
.
(
2
.
1
)
l
e
a
J
l
r
(
.
Q
8
,
%
)
a
s
U
I
F
Q
S
P
C
B
C
J
M
J
U
Z
P
G
C
P
U
I
.
Q
8
B
O
E
%
.
T
l
e
a
b
o
v
e
i
s
t
r
u
e
,
n
o
m
a
t
t
e
r
w
l
a
t
.
Q
8
a
n
J
%
r
e
f
e
r
e
n
c
e
.
S
o
i
t
'
s
j
u
s
t
a
s
t
r
u
e
t
l
a
t
t
l
e
i
n
v
e
r
s
e
p
r
o
b
a
b
i
l
i
t
y
,
l
r
(
%
|
.
Q
8
)
,
i
s
J
e
f
i
n
e
J
b
y
:
l
r
(
.
Q
8
,
%
)
=
l
r
(
%
|
.
Q
8
)
l
r
(
.
Q
8
)
.
(
2
.
2
)
N
o
w
s
i
n
c
e
b
o
t
l
(
2
.
1
)
a
n
J
(
2
.
2
)
c
o
n
t
a
i
n
l
r
(
.
Q
8
,
%
)
,
i
t
m
u
s
t
b
e
t
r
u
e
t
l
a
t
:
l
r
(
.
Q
8
|
%
)
l
r
(
%
)
=
l
r
(
.
Q
8
,
%
)
=
l
r
(
%
|
.
Q
8
)
l
r
(
.
Q
8
)
.
(
2
.
3
)
T
a
k
i
n
,
o
u
t
t
l
e
m
i
J
J
l
e
m
a
n
:
l
r
(
.
Q
8
|
%
)
l
r
(
%
)
=
l
r
(
%
|
.
Q
8
)
l
r
(
.
Q
8
)
.
(
2
.
+
)
/
l
l
t
l
a
t
r
e
m
a
i
n
s
i
s
t
o
u
s
e
a
l
i
t
t
l
e
s
e
c
o
n
J
a
r
y
s
c
l
o
o
l
a
l
,
e
b
r
a
a
n
J
s
o
l
v
e
L
q
u
a
t
i
o
n
2
.
+
f
o
r
l
r
(
.
Q
8
|
%
)
.
D
o
i
n
,
t
l
i
s
,
y
o
u
a
r
r
i
v
e
a
t
t
l
e
f
o
r
m
u
l
a
f
o
r
c
o
m
p
u
t
i
n
,
t
l
e
J
e
,
r
e
e
o
f
c
o
n
f
i
J
e
n
c
e
i
n
m
o
J
e
l
.
Q
8
,
c
o
n
J
i
t
i
o
n
e
J
o
n
t
l
e
J
a
t
a
,
%
:
l
r
(
.
Q
8
|
%
)
=
l
r
(
%
|
.
Q
8
)
l
r
(
%
)
l
r
(
.
Q
8
)
.
(
2
.
5
)
L
s
i
n
,
t
l
e
s
a
m
e
f
o
r
m
u
l
a
f
o
r
e
a
c
l
m
o
J
e
l
c
o
r
r
e
s
p
o
n
J
i
n
,
t
o
e
a
c
l
v
a
l
u
e
o
f
Q
8
,
y
o
u
a
r
r
i
v
e
a
t
t
l
e
p
o
s
t
e
r
i
o
r
J
i
s
t
r
i
b
u
t
i
o
n
.
2
.
2
.
5
.
B
a
y
e
s
'
t
b
e
o
r
e
m
:
A
c
o
n
J
i
t
i
o
n
i
n
g
e
n
g
i
n
e
.
T
l
e
r
e
s
u
l
t
i
n
L
q
u
a
-
t
i
o
n
2
.
5
i
s
u
s
u
a
l
l
y
r
e
f
e
r
r
e
J
t
o
a
s
#
B
Z
F
T

U
I
F
P
S
F
N
.
2
1
l
t
t
e
l
l
s
u
s
t
l
a
t
i
n
o
r
J
e
r
t
o
c
o
m
p
u
t
e
o
u
r
m
e
a
s
u
r
e
o
f
c
o
n
f
i
J
e
n
c
e
i
n
a
m
o
J
e
l
.
Q
8
,
c
o
n
J
i
t
i
o
n
e
J
o
n
t
l
e
e
v
i
J
e
n
c
e
%
,
w
e
n
e
e
J
t
o
c
o
m
p
u
t
e
:
(
1
)
T
l
e
M
J
L
F
M
J
I
P
P
E
,
l
r
(
%
|
.
Q
8
)
,
w
l
i
c
l
i
s
t
l
e
p
r
o
b
a
b
i
l
i
t
y
o
f
o
b
s
e
r
v
i
n
,
%
,
c
o
n
J
i
t
i
o
n
e
J
o
n
.
Q
8
,
e
n
e
r
a
t
i
n
,
t
l
e
J
a
t
a
.
(
2
)
T
l
e
p
r
o
b
a
b
i
l
i
t
y
o
f
t
l
e
e
v
i
J
e
n
c
e
,
l
r
(
%
)
,
w
l
i
c
l
i
s
t
l
e
p
r
o
b
a
b
i
l
i
t
y
o
f
o
b
s
e
r
v
i
n
,
t
l
e
J
a
t
a
%
,
a
v
e
r
a
,
e
J
o
v
e
r
a
l
l
m
o
J
e
l
s
w
e
'
J
l
i
k
e
t
o
c
o
n
s
i
J
e
r
.
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
0
0
.
0
2
0
p
W
c
o
n
f
i
d
e
n
c
e
2.2. HYlTHLSLS /ND MDLLS 33
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
0
0
.
0
2
0
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
0
p
W
c
o
n
f
i
d
e
n
c
e
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
5
0
.
0
3
0
p
W
c
o
n
f
i
d
e
n
c
e
'JHVSF Lxample probability Jistributions
lr(.
Q
8
|%). Tlese are commonly known as QPTUFSJPS
probability Jensities. Tlese Jistributions may be wiJe
or narrow, reflectin, Jifferences in uncertainty across
moJels. Tley may also be skeweJ or contain more tlan
one peak.
$PNQVUJOH DPOGJEFODF JO NPEFMT kay, so low Jo you com-
pute tlese probabilities, lr(.
Q
8
|%)` Well, since we are talkin, about a
probability, you can just use tle axioms of probability tleory to Jefine
it. ln tlis Jefinition, tlere will be pieces tlat tell us wlat information
we require in orJer to complete tle analysis.
Figure 2.1. Example posterior probability distributions.
Bayes theorem

How to compute Pr(M


pW
|D)?

Use axioms of probability to


find out
Thomas Bayes (17011761)
Bayes theorem

How to compute Pr(M


pW
|D)?

Use axioms of probability to


find out
Thomas Bayes (17011761)
Bayes theorem
36 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
Here's tle quickest way to Jerive a formula for lr(.
Q
8
|%). You re-
ally sloulJ pay attention to tlis Jerivation, because workin, tlrou,l
it yourself will lelp you remember tle formula for lr(.
Q
8
|%). /nJ if
you ever for,et tle exact formula, you can just finJ it a,ain, because tle
Jerivation really is tlat simple. lnowin, low simple it is to proJuce
tlis formula will also serve to impress upon you just low basic it is to
probability tleory. lt's not a fancy result at all, Jespite its importance.
Tle probability lr(.
Q
8
|%) is a DPOEJUJPOBM probability. lt says tlat tle
probability of .
Q
8
is conJitioneJ on %. Tle Jefinition of conJitional
probability is embeJJeJ in a common probability rule:
lr(.
Q
8
, %) = lr(.
Q
8
|%) lr(%). (2.1)
leaJ lr(.
Q
8
, %) as UIF QSPCBCJMJUZ PG CPUI .
Q
8
BOE %. Tle above is true,
no matter wlat .
Q
8
anJ % reference. So it's just as true tlat tle inverse
probability, lr(%|.
Q
8
), is JefineJ by:
lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.2)
Nowsince botl (2.1) anJ (2.2) contain lr(.
Q
8
, %), it must be true tlat:
lr(.
Q
8
|%) lr(%) = lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.3)
Takin, out tle miJJle man:
lr(.
Q
8
|%) lr(%) = lr(%|.
Q
8
) lr(.
Q
8
). (2.+)
/ll tlat remains is to use a little seconJary sclool al,ebra anJ solve
Lquation 2.+ for lr(.
Q
8
|%). Doin, tlis, you arrive at tle formula for
computin, tle Je,ree of confiJence in moJel .
Q
8
, conJitioneJ on tle
Jata, %:
lr(.
Q
8
|%) =
lr(%|.
Q
8
)
lr(%)
lr(.
Q
8
). (2.5)
Lsin, tle same formula for eacl moJel corresponJin, to eacl value of
Q
8
, you arrive at tle posterior Jistribution.
2.2.5. Bayes' tbeorem: A conJitioning engine. Tle result in Lqua-
tion 2.5 is usually referreJ to as #BZFT UIFPSFN.
21
lt tells us tlat in orJer
to compute our measure of confiJence in a moJel .
Q
8
, conJitioneJ on
tle eviJence %, we neeJ to compute:
(1) Tle MJLFMJIPPE, lr(%|.
Q
8
), wlicl is tle probability of observin,
%, conJitioneJ on .
Q
8
,eneratin, tle Jata.
(2) Tle probability of tle eviJence, lr(%), wlicl is tle probability
of observin, tle Jata %, avera,eJ over all moJels we'J like to
consiJer.
Bayes theorem
36 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
Here's tle quickest way to Jerive a formula for lr(.
Q
8
|%). You re-
ally sloulJ pay attention to tlis Jerivation, because workin, tlrou,l
it yourself will lelp you remember tle formula for lr(.
Q
8
|%). /nJ if
you ever for,et tle exact formula, you can just finJ it a,ain, because tle
Jerivation really is tlat simple. lnowin, low simple it is to proJuce
tlis formula will also serve to impress upon you just low basic it is to
probability tleory. lt's not a fancy result at all, Jespite its importance.
Tle probability lr(.
Q
8
|%) is a DPOEJUJPOBM probability. lt says tlat tle
probability of .
Q
8
is conJitioneJ on %. Tle Jefinition of conJitional
probability is embeJJeJ in a common probability rule:
lr(.
Q
8
, %) = lr(.
Q
8
|%) lr(%). (2.1)
leaJ lr(.
Q
8
, %) as UIF QSPCBCJMJUZ PG CPUI .
Q
8
BOE %. Tle above is true,
no matter wlat .
Q
8
anJ % reference. So it's just as true tlat tle inverse
probability, lr(%|.
Q
8
), is JefineJ by:
lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.2)
Nowsince botl (2.1) anJ (2.2) contain lr(.
Q
8
, %), it must be true tlat:
lr(.
Q
8
|%) lr(%) = lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.3)
Takin, out tle miJJle man:
lr(.
Q
8
|%) lr(%) = lr(%|.
Q
8
) lr(.
Q
8
). (2.+)
/ll tlat remains is to use a little seconJary sclool al,ebra anJ solve
Lquation 2.+ for lr(.
Q
8
|%). Doin, tlis, you arrive at tle formula for
computin, tle Je,ree of confiJence in moJel .
Q
8
, conJitioneJ on tle
Jata, %:
lr(.
Q
8
|%) =
lr(%|.
Q
8
)
lr(%)
lr(.
Q
8
). (2.5)
Lsin, tle same formula for eacl moJel corresponJin, to eacl value of
Q
8
, you arrive at tle posterior Jistribution.
2.2.5. Bayes' tbeorem: A conJitioning engine. Tle result in Lqua-
tion 2.5 is usually referreJ to as #BZFT UIFPSFN.
21
lt tells us tlat in orJer
to compute our measure of confiJence in a moJel .
Q
8
, conJitioneJ on
tle eviJence %, we neeJ to compute:
(1) Tle MJLFMJIPPE, lr(%|.
Q
8
), wlicl is tle probability of observin,
%, conJitioneJ on .
Q
8
,eneratin, tle Jata.
(2) Tle probability of tle eviJence, lr(%), wlicl is tle probability
of observin, tle Jata %, avera,eJ over all moJels we'J like to
consiJer.
2.2. HYlTHLSLS /ND MDLLS 37
(3) Tle QSJPS probability, lr(.), wlicl is tle Je,ree of confiJence
in . i,norin, tle eviJence.
So re-expresseJ in tlis way, Layes' tleorem says tlat:
losterior =
LikelilooJ
LviJence
lrior.
/,ain, tle Je,ree of confiJence lr(.
Q
8
|%) is usually known as tle QPT
UFSJPS probability. Layes' tleorem is just a formula for computin, tlis
posterior probability, wlicl proviJes one very useful way to compare
moJels.
lt is lelpful to tlink of Layes' tleorem as an en,ine for conJitionin,
one probability Jistribution, tle prior, on some eviJence. Tle result is a
new probability Jistribution, tle posterior. Tle posterior JepenJs upon
tle prior, tle eviJence, anJ tle assumptions emboJieJ in low you cal-
culate likelilooJs of eviJence. Tle en,ine requires tle user-you-to
proviJe tlese inputs, in orJer to Jo its work. lt neeJs tle prior, of course,
anJ you feeJ tlis into it. Tle en,ine itself Joes not tell us wlat tle prior
sloulJ be, only tlat it must be a probability Jistribution. Tle en,ine
also neeJs tle likelilooJ of eacl moJel, in orJer to compute tle likeli-
looJ, lr(%|.), anJ tle probability of tle eviJence, lr(%). /s you'll see
a bit later, if you know low to calculate lr(%|.), tlen you can at least
Jefine lr(%) anJ perlaps calculate it too. Tlese likelilooJs contain
lots of moJelin, assumptions, sucl as low samplin, works anJ wletler
or not samples are inJepenJent of one anotler. Tlis seems like a lot, l
know, but tlere will be lots of examples to work witl in tlis book. So
lan, on.
Civen all of tlese tlin,s, tle resultin, posterior probability will be
lo,ically correct. Lut if you feeJ ,arba,e into it, tlen it will proviJe
only lo,ical ,arba,e out tle otler enJ. So like witl all matlematics,
its internal colerence is not ,oin, to save us from our external inco-
lerence. /s l ar,ueJ in Clapter 1, statistical analysis about particular
moJels lives in a small worlJ, wlile science lives in a lar,e worlJ. We al-
ways work back anJ fortl between tlese worlJs, eacl teaclin, us about
tle otler. l emplasize tlis point a,ain anJ a,ain, because it is temptin,
to tlink tlat tle lo,ical small worlJ of probability tlinkin, can some-
low encompass all of scientific inference. Tlis is not possible. Layes'
tleorem, for example, is as Jeep anJ true a lo,ical statement as you will
finJ in any brancl of matlematics. Lut tle tleorem itself tells us notl-
in, about wlat it sloulJ be useJ for, nor low to builJ assumptions into
calculations of likelilooJs, nor any number of otler consiJerations.
The conditioning engine
Prior Posterior
Likelihood
Posterior is prior conditioned on evidence.
Prior

Priors are essentially arbitrary,


just like other parts of models

Three major strategies for choosing them:


1. Subjective
2. Objective
3. Tuning
36 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
Here's tle quickest way to Jerive a formula for lr(.
Q
8
|%). You re-
ally sloulJ pay attention to tlis Jerivation, because workin, tlrou,l
it yourself will lelp you remember tle formula for lr(.
Q
8
|%). /nJ if
you ever for,et tle exact formula, you can just finJ it a,ain, because tle
Jerivation really is tlat simple. lnowin, low simple it is to proJuce
tlis formula will also serve to impress upon you just low basic it is to
probability tleory. lt's not a fancy result at all, Jespite its importance.
Tle probability lr(.
Q
8
|%) is a DPOEJUJPOBM probability. lt says tlat tle
probability of .
Q
8
is conJitioneJ on %. Tle Jefinition of conJitional
probability is embeJJeJ in a common probability rule:
lr(.
Q
8
, %) = lr(.
Q
8
|%) lr(%). (2.1)
leaJ lr(.
Q
8
, %) as UIF QSPCBCJMJUZ PG CPUI .
Q
8
BOE %. Tle above is true,
no matter wlat .
Q
8
anJ % reference. So it's just as true tlat tle inverse
probability, lr(%|.
Q
8
), is JefineJ by:
lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.2)
Nowsince botl (2.1) anJ (2.2) contain lr(.
Q
8
, %), it must be true tlat:
lr(.
Q
8
|%) lr(%) = lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.3)
Takin, out tle miJJle man:
lr(.
Q
8
|%) lr(%) = lr(%|.
Q
8
) lr(.
Q
8
). (2.+)
/ll tlat remains is to use a little seconJary sclool al,ebra anJ solve
Lquation 2.+ for lr(.
Q
8
|%). Doin, tlis, you arrive at tle formula for
computin, tle Je,ree of confiJence in moJel .
Q
8
, conJitioneJ on tle
Jata, %:
lr(.
Q
8
|%) =
lr(%|.
Q
8
)
lr(%)
lr(.
Q
8
). (2.5)
Lsin, tle same formula for eacl moJel corresponJin, to eacl value of
Q
8
, you arrive at tle posterior Jistribution.
2.2.5. Bayes' tbeorem: A conJitioning engine. Tle result in Lqua-
tion 2.5 is usually referreJ to as #BZFT UIFPSFN.
21
lt tells us tlat in orJer
to compute our measure of confiJence in a moJel .
Q
8
, conJitioneJ on
tle eviJence %, we neeJ to compute:
(1) Tle MJLFMJIPPE, lr(%|.
Q
8
), wlicl is tle probability of observin,
%, conJitioneJ on .
Q
8
,eneratin, tle Jata.
(2) Tle probability of tle eviJence, lr(%), wlicl is tle probability
of observin, tle Jata %, avera,eJ over all moJels we'J like to
consiJer.
Objective priors

Objective: Based on some argument, not on


personal opinion

Previous data, posterior becomes prior


Prior Posterior Posterior Posterior
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 0
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 1
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 2
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 3
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 4
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 5
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 6
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 7
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 8
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 9
probability of water
p
r
o
b
a
b
i
l
i
t
y
W L W W L W L L W W
D = { }
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
1
.
0
2
.
0
n = 10
probability of water
p
r
o
b
a
b
i
l
i
t
y
n = 1 n = 2 n = 3 n = 4 n = 5
n = 6 n = 7 n = 8 n = 9 n = 10

One-at-a-time or all-at-once, you get the


same posterior in the end.

Every posterior is a prior.

Every prior is somethings posterior.


Objective priors

Objective: Based on some argument,


not on personal opinion

Non-informative priors (complicated)

Maximum entropy: partially-


informative
A Jeffreys Prior, a
famous non-informative
prior
Likelihood

Probabilities* of data, assuming a model

Likelihood function is part of the model


36 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
Here's tle quickest way to Jerive a formula for lr(.
Q
8
|%). You re-
ally sloulJ pay attention to tlis Jerivation, because workin, tlrou,l
it yourself will lelp you remember tle formula for lr(.
Q
8
|%). /nJ if
you ever for,et tle exact formula, you can just finJ it a,ain, because tle
Jerivation really is tlat simple. lnowin, low simple it is to proJuce
tlis formula will also serve to impress upon you just low basic it is to
probability tleory. lt's not a fancy result at all, Jespite its importance.
Tle probability lr(.
Q
8
|%) is a DPOEJUJPOBM probability. lt says tlat tle
probability of .
Q
8
is conJitioneJ on %. Tle Jefinition of conJitional
probability is embeJJeJ in a common probability rule:
lr(.
Q
8
, %) = lr(.
Q
8
|%) lr(%). (2.1)
leaJ lr(.
Q
8
, %) as UIF QSPCBCJMJUZ PG CPUI .
Q
8
BOE %. Tle above is true,
no matter wlat .
Q
8
anJ % reference. So it's just as true tlat tle inverse
probability, lr(%|.
Q
8
), is JefineJ by:
lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.2)
Nowsince botl (2.1) anJ (2.2) contain lr(.
Q
8
, %), it must be true tlat:
lr(.
Q
8
|%) lr(%) = lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.3)
Takin, out tle miJJle man:
lr(.
Q
8
|%) lr(%) = lr(%|.
Q
8
) lr(.
Q
8
). (2.+)
/ll tlat remains is to use a little seconJary sclool al,ebra anJ solve
Lquation 2.+ for lr(.
Q
8
|%). Doin, tlis, you arrive at tle formula for
computin, tle Je,ree of confiJence in moJel .
Q
8
, conJitioneJ on tle
Jata, %:
lr(.
Q
8
|%) =
lr(%|.
Q
8
)
lr(%)
lr(.
Q
8
). (2.5)
Lsin, tle same formula for eacl moJel corresponJin, to eacl value of
Q
8
, you arrive at tle posterior Jistribution.
2.2.5. Bayes' tbeorem: A conJitioning engine. Tle result in Lqua-
tion 2.5 is usually referreJ to as #BZFT UIFPSFN.
21
lt tells us tlat in orJer
to compute our measure of confiJence in a moJel .
Q
8
, conJitioneJ on
tle eviJence %, we neeJ to compute:
(1) Tle MJLFMJIPPE, lr(%|.
Q
8
), wlicl is tle probability of observin,
%, conJitioneJ on .
Q
8
,eneratin, tle Jata.
(2) Tle probability of tle eviJence, lr(%), wlicl is tle probability
of observin, tle Jata %, avera,eJ over all moJels we'J like to
consiJer.
* Not important right now, but really derivatives of cumulative probabilities
Likelihood
2.2. HYlTHLSLS /ND MDLLS +1
(1) Lacl toss of tle ,lobe is inJepenJent of all otlers, in tle sense
tlat tle outcomes Jo not influence one anotler.
(2) Tle only possible results of a toss are water" or lanJ."
(3) Tle probability eacl toss results in a water" ratler tlan a
lanJ" is constant across tosses anJ equal to Q
8
.
Tlese assumptions imply a familiar probability Jensity, tle binomial.
Let O
8
be tle observeJ count of W's (water") in tle sample anJ O be
tle total number of tosses of tle ,lobe (tle size of tle sample). Tlese
two values summarize tle Jata, at least as far as tle assumptions above
require. lf we were to assume tlat tosses were not inJepenJent, but
insteaJ influenceJ one anotler in orJer, tlen we'J neeJ to know tle
exact sequence of W's anJ L's. Lut stickin, witl tle simple binomial
moJel, tle probability of obtainin, O
8
out of O tosses is ,iven by:
lr(O
8
|O, Q
8
) = -(Q
8
|O, O
8
) =
O!
O
8
!(O O
8
)!
Q
O
8
8
(1 Q
8
)
OO
8
.
Tlis expression is known as a MJLFMJIPPE GVODUJPO. LikelilooJ functions
are often representeJ by notation like -(Q
8
|O, O
8
), anJ most people
speak of tle likelilooJ of tle moJel" or tle likelilooJ of tle param-
eter." Lut keep in minJ tlat likelilooJs are not probabilities of moJels
or parameters, lr(.|%). lnsteaJ, tley are probabilities of Jata, con-
JitioneJ on moJels, lr(%|.). lt is conventional to write -(.|%) or,
usin, a fancy L," L(.|%), but tlis Joesn't clan,e tle fact about wlat
tle probability really references: observations.
29
To use tle likelilooJ formula, you lave to assume values of tle pa-
rameters, like Q
8
, anJ tlen plu, in tle Jata. lor example, usin, tle
sample from earlier in tle clapter, if we toss tle ,lobe 1O times anJ
observe 6 waters anJ + lanJs, tle formula becomes:
lr(6|1O, Q
8
) =
1O!
6!+!
Q
6
8
(1 Q
8
)
+
.
Now suppose tlat Q
8
= O.7, tlen we arrive at:
lr(6|1O, O.7) =
1O!
6!+!
O.7
6
(1 O.7)
+
O.2.
leaJ tle left siJe as tle probability of observin, 6 W anJ + L, ,iven
tlat tle probability of water is O.7." Tle ri,lt siJe is just tle binomial
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )
2.2. HYlTHLSLS /ND MDLLS +1
(1) Lacl toss of tle ,lobe is inJepenJent of all otlers, in tle sense
tlat tle outcomes Jo not influence one anotler.
(2) Tle only possible results of a toss are water" or lanJ."
(3) Tle probability eacl toss results in a water" ratler tlan a
lanJ" is constant across tosses anJ equal to Q
8
.
Tlese assumptions imply a familiar probability Jensity, tle binomial.
Let O
8
be tle observeJ count of W's (water") in tle sample anJ O be
tle total number of tosses of tle ,lobe (tle size of tle sample). Tlese
two values summarize tle Jata, at least as far as tle assumptions above
require. lf we were to assume tlat tosses were not inJepenJent, but
insteaJ influenceJ one anotler in orJer, tlen we'J neeJ to know tle
exact sequence of W's anJ L's. Lut stickin, witl tle simple binomial
moJel, tle probability of obtainin, O
8
out of O tosses is ,iven by:
lr(O
8
|O, Q
8
) = -(Q
8
|O, O
8
) =
O!
O
8
!(O O
8
)!
Q
O
8
8
(1 Q
8
)
OO
8
.
Tlis expression is known as a MJLFMJIPPE GVODUJPO. LikelilooJ functions
are often representeJ by notation like -(Q
8
|O, O
8
), anJ most people
speak of tle likelilooJ of tle moJel" or tle likelilooJ of tle param-
eter." Lut keep in minJ tlat likelilooJs are not probabilities of moJels
or parameters, lr(.|%). lnsteaJ, tley are probabilities of Jata, con-
JitioneJ on moJels, lr(%|.). lt is conventional to write -(.|%) or,
usin, a fancy L," L(.|%), but tlis Joesn't clan,e tle fact about wlat
tle probability really references: observations.
29
To use tle likelilooJ formula, you lave to assume values of tle pa-
rameters, like Q
8
, anJ tlen plu, in tle Jata. lor example, usin, tle
sample from earlier in tle clapter, if we toss tle ,lobe 1O times anJ
observe 6 waters anJ + lanJs, tle formula becomes:
lr(6|1O, Q
8
) =
1O!
6!+!
Q
6
8
(1 Q
8
)
+
.
Now suppose tlat Q
8
= O.7, tlen we arrive at:
lr(6|1O, O.7) =
1O!
6!+!
O.7
6
(1 O.7)
+
O.2.
leaJ tle left siJe as tle probability of observin, 6 W anJ + L, ,iven
tlat tle probability of water is O.7." Tle ri,lt siJe is just tle binomial
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )
In globe tossing context:
n
W
: D, number of Ws observed
n: number of tosses
Likelihood
2.2. HYlTHLSLS /ND MDLLS +1
(1) Lacl toss of tle ,lobe is inJepenJent of all otlers, in tle sense
tlat tle outcomes Jo not influence one anotler.
(2) Tle only possible results of a toss are water" or lanJ."
(3) Tle probability eacl toss results in a water" ratler tlan a
lanJ" is constant across tosses anJ equal to Q
8
.
Tlese assumptions imply a familiar probability Jensity, tle binomial.
Let O
8
be tle observeJ count of W's (water") in tle sample anJ O be
tle total number of tosses of tle ,lobe (tle size of tle sample). Tlese
two values summarize tle Jata, at least as far as tle assumptions above
require. lf we were to assume tlat tosses were not inJepenJent, but
insteaJ influenceJ one anotler in orJer, tlen we'J neeJ to know tle
exact sequence of W's anJ L's. Lut stickin, witl tle simple binomial
moJel, tle probability of obtainin, O
8
out of O tosses is ,iven by:
lr(O
8
|O, Q
8
) = -(Q
8
|O, O
8
) =
O!
O
8
!(O O
8
)!
Q
O
8
8
(1 Q
8
)
OO
8
.
Tlis expression is known as a MJLFMJIPPE GVODUJPO. LikelilooJ functions
are often representeJ by notation like -(Q
8
|O, O
8
), anJ most people
speak of tle likelilooJ of tle moJel" or tle likelilooJ of tle param-
eter." Lut keep in minJ tlat likelilooJs are not probabilities of moJels
or parameters, lr(.|%). lnsteaJ, tley are probabilities of Jata, con-
JitioneJ on moJels, lr(%|.). lt is conventional to write -(.|%) or,
usin, a fancy L," L(.|%), but tlis Joesn't clan,e tle fact about wlat
tle probability really references: observations.
29
To use tle likelilooJ formula, you lave to assume values of tle pa-
rameters, like Q
8
, anJ tlen plu, in tle Jata. lor example, usin, tle
sample from earlier in tle clapter, if we toss tle ,lobe 1O times anJ
observe 6 waters anJ + lanJs, tle formula becomes:
lr(6|1O, Q
8
) =
1O!
6!+!
Q
6
8
(1 Q
8
)
+
.
Now suppose tlat Q
8
= O.7, tlen we arrive at:
lr(6|1O, O.7) =
1O!
6!+!
O.7
6
(1 O.7)
+
O.2.
leaJ tle left siJe as tle probability of observin, 6 W anJ + L, ,iven
tlat tle probability of water is O.7." Tle ri,lt siJe is just tle binomial
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )

Focus on p
W
= 0.7:
Likelihood
2.2. HYlTHLSLS /ND MDLLS +1
(1) Lacl toss of tle ,lobe is inJepenJent of all otlers, in tle sense
tlat tle outcomes Jo not influence one anotler.
(2) Tle only possible results of a toss are water" or lanJ."
(3) Tle probability eacl toss results in a water" ratler tlan a
lanJ" is constant across tosses anJ equal to Q
8
.
Tlese assumptions imply a familiar probability Jensity, tle binomial.
Let O
8
be tle observeJ count of W's (water") in tle sample anJ O be
tle total number of tosses of tle ,lobe (tle size of tle sample). Tlese
two values summarize tle Jata, at least as far as tle assumptions above
require. lf we were to assume tlat tosses were not inJepenJent, but
insteaJ influenceJ one anotler in orJer, tlen we'J neeJ to know tle
exact sequence of W's anJ L's. Lut stickin, witl tle simple binomial
moJel, tle probability of obtainin, O
8
out of O tosses is ,iven by:
lr(O
8
|O, Q
8
) = -(Q
8
|O, O
8
) =
O!
O
8
!(O O
8
)!
Q
O
8
8
(1 Q
8
)
OO
8
.
Tlis expression is known as a MJLFMJIPPE GVODUJPO. LikelilooJ functions
are often representeJ by notation like -(Q
8
|O, O
8
), anJ most people
speak of tle likelilooJ of tle moJel" or tle likelilooJ of tle param-
eter." Lut keep in minJ tlat likelilooJs are not probabilities of moJels
or parameters, lr(.|%). lnsteaJ, tley are probabilities of Jata, con-
JitioneJ on moJels, lr(%|.). lt is conventional to write -(.|%) or,
usin, a fancy L," L(.|%), but tlis Joesn't clan,e tle fact about wlat
tle probability really references: observations.
29
To use tle likelilooJ formula, you lave to assume values of tle pa-
rameters, like Q
8
, anJ tlen plu, in tle Jata. lor example, usin, tle
sample from earlier in tle clapter, if we toss tle ,lobe 1O times anJ
observe 6 waters anJ + lanJs, tle formula becomes:
lr(6|1O, Q
8
) =
1O!
6!+!
Q
6
8
(1 Q
8
)
+
.
Now suppose tlat Q
8
= O.7, tlen we arrive at:
lr(6|1O, O.7) =
1O!
6!+!
O.7
6
(1 O.7)
+
O.2.
leaJ tle left siJe as tle probability of observin, 6 W anJ + L, ,iven
tlat tle probability of water is O.7." Tle ri,lt siJe is just tle binomial
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )

Focus on p
W
= 0.7:
2.2. HYlTHLSLS /ND MDLLS +1
(1) Lacl toss of tle ,lobe is inJepenJent of all otlers, in tle sense
tlat tle outcomes Jo not influence one anotler.
(2) Tle only possible results of a toss are water" or lanJ."
(3) Tle probability eacl toss results in a water" ratler tlan a
lanJ" is constant across tosses anJ equal to Q
8
.
Tlese assumptions imply a familiar probability Jensity, tle binomial.
Let O
8
be tle observeJ count of W's (water") in tle sample anJ O be
tle total number of tosses of tle ,lobe (tle size of tle sample). Tlese
two values summarize tle Jata, at least as far as tle assumptions above
require. lf we were to assume tlat tosses were not inJepenJent, but
insteaJ influenceJ one anotler in orJer, tlen we'J neeJ to know tle
exact sequence of W's anJ L's. Lut stickin, witl tle simple binomial
moJel, tle probability of obtainin, O
8
out of O tosses is ,iven by:
lr(O
8
|O, Q
8
) = -(Q
8
|O, O
8
) =
O!
O
8
!(O O
8
)!
Q
O
8
8
(1 Q
8
)
OO
8
.
Tlis expression is known as a MJLFMJIPPE GVODUJPO. LikelilooJ functions
are often representeJ by notation like -(Q
8
|O, O
8
), anJ most people
speak of tle likelilooJ of tle moJel" or tle likelilooJ of tle param-
eter." Lut keep in minJ tlat likelilooJs are not probabilities of moJels
or parameters, lr(.|%). lnsteaJ, tley are probabilities of Jata, con-
JitioneJ on moJels, lr(%|.). lt is conventional to write -(.|%) or,
usin, a fancy L," L(.|%), but tlis Joesn't clan,e tle fact about wlat
tle probability really references: observations.
29
To use tle likelilooJ formula, you lave to assume values of tle pa-
rameters, like Q
8
, anJ tlen plu, in tle Jata. lor example, usin, tle
sample from earlier in tle clapter, if we toss tle ,lobe 1O times anJ
observe 6 waters anJ + lanJs, tle formula becomes:
lr(6|1O, Q
8
) =
1O!
6!+!
Q
6
8
(1 Q
8
)
+
.
Now suppose tlat Q
8
= O.7, tlen we arrive at:
lr(6|1O, O.7) =
1O!
6!+!
O.7
6
(1 O.7)
+
O.2.
leaJ tle left siJe as tle probability of observin, 6 W anJ + L, ,iven
tlat tle probability of water is O.7." Tle ri,lt siJe is just tle binomial
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )
+2 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
.
1
0
0
.
2
0
proportion of water
p
r
o
b
a
b
i
l
i
t
y

o
f

d
a
t
a
'JHVSF Tle probability of
observin, 6 waters anJ + lanJs
in 1O samples from our ,lobe
(vertical axis), for every pos-
sible true proportion of water
(lorizontal axis).
|1] 0.2001209
leaJ tlat line of coJe as, compute tle probability of observin, 6 W in
1O tosses, wlere eacl toss las a probability of O.7 of bein, W."
ne of tle virtues of a statistical environment like l is tlat it makes
it easy for you to calculate every likelilooJ over tle ran,e of Q
8
from O
to 1, lolJin, tle Jata O
8
, O constant. Tlis coJe will plot tlem all:
l coJe
2.2
cuJve{ db1hom{ 6 , s1ze=10 , pJob=x ) , 1Jom=0 , 1o=1 )
l reproJuce tlis plot in 'JHVSF . Tle reaJer really sloulJ execute
tlis coJe in l. Do tlis now. You Jon't lave to unJerstanJ tle coJe yet.
Lut you'll ,et a better feel for wlat we are calculatin,, if you type tle
above line of coJe into l.
Wlat Joes 'JHVSF tell us` Lvery possible value of Q
8
implies a
probability of tle Jata, 6 W's anJ + L's. Tle lei,lt of tle curve at eacl
point is tlat probability, or likelilooJ. lirst, realize tlat eacl value of
Q
8
corresponJs to a Jifferent statistical moJel. Lacl moJel corresponJs
to a Jifferent lypotlesis about tle true proportion of water coverin, tle
planet. Lacl moJel can be useJ as above to lelp us JeciJe low likely
our Jata are, assumin, tle moJel is true. Tlus eacl moJel proJuces a
likelilooJ, anJ tlese are wlat are plotteJ in 'JHVSF . lf tle observeJ
Jata are very unlikely, for some value of Q
8
, tlen tle posterior proba-
bility will also be lower for tlat value of Q
8
. ln tlis case, as in many
cases, tlere is a unique value of Q
8
tlat maximizes tle likelilooJ. Tlis
value is Q
8
= O.6, wlicl is tle same as tle sample proportion of water,
+O 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
.
1
0
0
.
2
0
proportion of water
p
r
o
b
a
b
i
l
i
t
y

o
f

d
a
t
a
'JHVSF Tle probability of
observin, 6 waters anJ + lanJs
in 1O samples from our ,lobe
(vertical axis), for every pos-
sible true proportion of water
(lorizontal axis).
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )
|1] 0.2001209
leaJ tlat line of coJe as, compute tle probability of observin, 6 W in
1O tosses, wlere eacl toss las a probability of O.7 of bein, W."
ne of tle virtues of a statistical environment like l is tlat it makes
it easy for you to calculate every likelilooJ over tle ran,e of Q
8
from O
to 1, lolJin, tle Jata O
8
, O constant. Tlis coJe will plot tlem all:
l coJe
2.2
cuJve{ db1hom{ 6 , s1ze=10 , pJob=x ) , 1Jom=0 , 1o=1 )
l reproJuce tlis plot in 'JHVSF . Tle reaJer really sloulJ execute
tlis coJe in l. Do tlis now. You Jon't lave to unJerstanJ tle coJe yet.
Lut you'll ,et a better feel for wlat we are calculatin,, if you type tle
above line of coJe into l.
Wlat Joes 'JHVSF tell us` Lvery possible value of Q
8
implies a
probability of tle Jata, 6 W's anJ + L's. Tle lei,lt of tle curve at eacl
point is tlat probability, or likelilooJ. lirst, realize tlat eacl value of
Q
8
corresponJs to a Jifferent statistical moJel. Lacl moJel corresponJs
to a Jifferent lypotlesis about tle true proportion of water coverin, tle
planet. Lacl moJel can be useJ as above to lelp us JeciJe low likely
our Jata are, assumin, tle moJel is true. Tlus eacl moJel proJuces a
likelilooJ, anJ tlese are wlat are plotteJ in 'JHVSF . lf tle observeJ
+O 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
.
1
0
0
.
2
0
proportion of water
p
r
o
b
a
b
i
l
i
t
y

o
f

d
a
t
a
'JHVSF Tle probability of
observin, 6 waters anJ + lanJs
in 1O samples from our ,lobe
(vertical axis), for every pos-
sible true proportion of water
(lorizontal axis).
Jensity, witl tle values plu,,eJ in. lf you want to evaluate tlis formula,
it is built in to l, so just execute:
l coJe
2.1
db1hom{ 6 , s1ze=10 , pJob=0.7 )
|1] 0.2001209
leaJ tlat line of coJe as, compute tle probability of observin, 6 W in
1O tosses, wlere eacl toss las a probability of O.7 of bein, W."
ne of tle virtues of a statistical environment like l is tlat it makes
it easy for you to calculate every likelilooJ over tle ran,e of Q
8
from O
to 1, lolJin, tle Jata O
8
, O constant. Tlis coJe will plot tlem all:
l coJe
2.2
cuJve{ db1hom{ 6 , s1ze=10 , pJob=x ) , 1Jom=0 , 1o=1 )
l reproJuce tlis plot in 'JHVSF . Tle reaJer really sloulJ execute
tlis coJe in l. Do tlis now. You Jon't lave to unJerstanJ tle coJe yet.
Lut you'll ,et a better feel for wlat we are calculatin,, if you type tle
above line of coJe into l.
Wlat Joes 'JHVSF tell us` Lvery possible value of Q
8
implies a
probability of tle Jata, 6 W's anJ + L's. Tle lei,lt of tle curve at eacl
point is tlat probability, or likelilooJ. lirst, realize tlat eacl value of
Q
8
corresponJs to a Jifferent statistical moJel. Lacl moJel corresponJs
to a Jifferent lypotlesis about tle true proportion of water coverin, tle
planet. Lacl moJel can be useJ as above to lelp us JeciJe low likely
our Jata are, assumin, tle moJel is true. Tlus eacl moJel proJuces a
likelilooJ, anJ tlese are wlat are plotteJ in 'JHVSF . lf tle observeJ
Probability of data

Pr(D) is weighted average likelihood:

Prior does the weighting.

Job is to standardize posterior so it sums to one.


36 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
Here's tle quickest way to Jerive a formula for lr(.
Q
8
|%). You re-
ally sloulJ pay attention to tlis Jerivation, because workin, tlrou,l
it yourself will lelp you remember tle formula for lr(.
Q
8
|%). /nJ if
you ever for,et tle exact formula, you can just finJ it a,ain, because tle
Jerivation really is tlat simple. lnowin, low simple it is to proJuce
tlis formula will also serve to impress upon you just low basic it is to
probability tleory. lt's not a fancy result at all, Jespite its importance.
Tle probability lr(.
Q
8
|%) is a DPOEJUJPOBM probability. lt says tlat tle
probability of .
Q
8
is conJitioneJ on %. Tle Jefinition of conJitional
probability is embeJJeJ in a common probability rule:
lr(.
Q
8
, %) = lr(.
Q
8
|%) lr(%). (2.1)
leaJ lr(.
Q
8
, %) as UIF QSPCBCJMJUZ PG CPUI .
Q
8
BOE %. Tle above is true,
no matter wlat .
Q
8
anJ % reference. So it's just as true tlat tle inverse
probability, lr(%|.
Q
8
), is JefineJ by:
lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.2)
Nowsince botl (2.1) anJ (2.2) contain lr(.
Q
8
, %), it must be true tlat:
lr(.
Q
8
|%) lr(%) = lr(.
Q
8
, %) = lr(%|.
Q
8
) lr(.
Q
8
). (2.3)
Takin, out tle miJJle man:
lr(.
Q
8
|%) lr(%) = lr(%|.
Q
8
) lr(.
Q
8
). (2.+)
/ll tlat remains is to use a little seconJary sclool al,ebra anJ solve
Lquation 2.+ for lr(.
Q
8
|%). Doin, tlis, you arrive at tle formula for
computin, tle Je,ree of confiJence in moJel .
Q
8
, conJitioneJ on tle
Jata, %:
lr(.
Q
8
|%) =
lr(%|.
Q
8
)
lr(%)
lr(.
Q
8
). (2.5)
Lsin, tle same formula for eacl moJel corresponJin, to eacl value of
Q
8
, you arrive at tle posterior Jistribution.
2.2.5. Bayes' tbeorem: A conJitioning engine. Tle result in Lqua-
tion 2.5 is usually referreJ to as #BZFT UIFPSFN.
21
lt tells us tlat in orJer
to compute our measure of confiJence in a moJel .
Q
8
, conJitioneJ on
tle eviJence %, we neeJ to compute:
(1) Tle MJLFMJIPPE, lr(%|.
Q
8
), wlicl is tle probability of observin,
%, conJitioneJ on .
Q
8
,eneratin, tle Jata.
(2) Tle probability of tle eviJence, lr(%), wlicl is tle probability
of observin, tle Jata %, avera,eJ over all moJels we'J like to
consiJer.
2.2. HYlTHLSLS /ND MDLLS +3
6/1O = O.6. Tlis value of Q
8
is an example of a NBYJNVN MJLFMJIPPE FT
UJNBUF. Tlere will be mucl more to say about sucl estimates a bit later
in tlis clapter.
SeconJ, tlese likelilooJs Jo not aJJ up to one, unlike tle prior anJ
posterior Jensities. Tlis is because likelilooJs reference tle Jata, wlicl
Joes not clan,e alon, tle lorizontal axis in 'JHVSF . Tlis means tle
area unJer tle curve neeJ not anJ usually will not aJJ up (inte,rate) to
one. Tlis is just like sayin, tlat tle proportion of women wlo wear
,lasses, lr(,lasses|woman), anJ tle proportion of men wlo wear ,lasses,
lr(,lasses|man), neeJ not aJJ up to one. LikelilooJs vary wlat tle
probability is conJitioneJ on. Tlis is unlike tle posterior probabilities,
wlicl vary wlat is referenceJ, wlat is on tle left siJe of tle pipe"
symbol, |.
2.2.7.2. "WFSBHF MJLFMJIPPE PG FWJEFODF Tle next important tlin, to
Jo witl likelilooJ is to fi,ure out tle probability of tle eviJence, lr(%).
Tlis is just tle avera,e likelilooJ, wlere tle avera,in, is over moJels.
Since a likelilooJ lr(%|.
Q
8
) is a probability of tle Jata, if you com-
pute a wei,lteJ avera,e of tlese likelilooJs, you ,et tle unconJitional
probability of tle Jata, lr(%). Tlis probability is still conJitional on
tle moJels, in tle sense tlat it is avera,eJ only over tlose moJels you
explicitly incluJe in tle analysis. So assumptions about tle worlJ still
intruJe lere, as tley always will. Lut lr(%) is not conJitional on a sin,le
moJel, like a likelilooJ lr(%|.
Q
8
) is.
So lr(%) is a wei,lteJ avera,e likelilooJ. Lut wlere Jo we ,et tle
wei,lts of eacl moJel, neeJeJ to compute tlis wei,lteJ avera,e` lrom
tle prior probability Jensity. Tle probability of tle eviJence is ,iven
by:
lr(%) =

lr(%|.
Q
8
) lr(.
Q
8
)EQ
8
.
/ll tlis really says is to multiply eacl likelilooJ by its corresponJin,
prior anJ tlen aJJ up all of tlese proJucts. Tlis results in a wei,lteJ
avera,e likelilooJ. Tlis sort of probability is often calleJ a NBSHJOBM
MJLFMJIPPE, anJ we'll meet it a,ain in Clapter 5 anJ explain wly at tlat
point.
More important for now is to appreciate tle job it Joes. Tle role of
tlis probability of eviJence insiJe tle conJitionin, en,ine, Layes' tle-
orem, is to normalize tle posterior, so tlat it always sums to one. Tlis
ensures tlat tle posterior is a colerent probability Jensity, just like tle
prior. Tle relative ma,nituJes of tle Jifferent moJels unJer consiJera-
tion will not clan,e, wlatever value you assi,n to lr(%). Tlis is because
Estimating the posterior
1. Analytical approach (often impossible)
2. Grid approximation (very intensive)
3. Markov Chain Monte Carlo (less intensive)
4. Maximum likelihood and quadratic
approximation (approximate)
Estimating the posterior
1. Analytical approach (often impossible)
2. Grid approximation (very intensive)
3. Markov Chain Monte Carlo (less intensive)
4. Maximum likelihood and quadratic
approximation (approximate)
Grid approximation

The posterior is merely:


the standardized product of the likelihood and prior.

Grid approximation uses finite grid of models


instead of infinite space of models.
2.2. HYlTHLSLS /ND MDLLS 37
(3) Tle QSJPS probability, lr(.), wlicl is tle Je,ree of confiJence
in . i,norin, tle eviJence.
So re-expresseJ in tlis way, Layes' tleorem says tlat:
losterior =
LikelilooJ
LviJence
lrior.
/,ain, tle Je,ree of confiJence lr(.
Q
8
|%) is usually known as tle QPT
UFSJPS probability. Layes' tleorem is just a formula for computin, tlis
posterior probability, wlicl proviJes one very useful way to compare
moJels.
lt is lelpful to tlink of Layes' tleorem as an en,ine for conJitionin,
one probability Jistribution, tle prior, on some eviJence. Tle result is a
new probability Jistribution, tle posterior. Tle posterior JepenJs upon
tle prior, tle eviJence, anJ tle assumptions emboJieJ in low you cal-
culate likelilooJs of eviJence. Tle en,ine requires tle user-you-to
proviJe tlese inputs, in orJer to Jo its work. lt neeJs tle prior, of course,
anJ you feeJ tlis into it. Tle en,ine itself Joes not tell us wlat tle prior
sloulJ be, only tlat it must be a probability Jistribution. Tle en,ine
also neeJs tle likelilooJ of eacl moJel, in orJer to compute tle likeli-
looJ, lr(%|.), anJ tle probability of tle eviJence, lr(%). /s you'll see
a bit later, if you know low to calculate lr(%|.), tlen you can at least
Jefine lr(%) anJ perlaps calculate it too. Tlese likelilooJs contain
lots of moJelin, assumptions, sucl as low samplin, works anJ wletler
or not samples are inJepenJent of one anotler. Tlis seems like a lot, l
know, but tlere will be lots of examples to work witl in tlis book. So
lan, on.
Civen all of tlese tlin,s, tle resultin, posterior probability will be
lo,ically correct. Lut if you feeJ ,arba,e into it, tlen it will proviJe
only lo,ical ,arba,e out tle otler enJ. So like witl all matlematics,
its internal colerence is not ,oin, to save us from our external inco-
lerence. /s l ar,ueJ in Clapter 1, statistical analysis about particular
moJels lives in a small worlJ, wlile science lives in a lar,e worlJ. We al-
ways work back anJ fortl between tlese worlJs, eacl teaclin, us about
tle otler. l emplasize tlis point a,ain anJ a,ain, because it is temptin,
to tlink tlat tle lo,ical small worlJ of probability tlinkin, can some-
low encompass all of scientific inference. Tlis is not possible. Layes'
tleorem, for example, is as Jeep anJ true a lo,ical statement as you will
finJ in any brancl of matlematics. Lut tle tleorem itself tells us notl-
in, about wlat it sloulJ be useJ for, nor low to builJ assumptions into
calculations of likelilooJs, nor any number of otler consiJerations.
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.5 1
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.5 1
3 models
0
.
0
0
.
2
0
.
4
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.25 0.5 0.75 1
5 models
10 models
0
.
0
0
0
.
1
0
0
.
2
0
0
.
3
0
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.22 0.44 0.67 0.89
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.21 0.47 0.74 1
20 models
0
.
0
0
0
0
0
.
0
0
1
0
0
.
0
0
2
0
proportion of water
p
o
s
t
e
r
i
o
r

p
r
o
b
a
b
i
l
i
t
y
0 0.17 0.39 0.6 0.8 1
1000 models
+8 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
'JHVSF losterior Jensities proJuceJ by ,riJ approx-
imation. ln eacl row, a specific prior is combineJ witl
tle likelilooJs of observin, 6 waters anJ + lanJs to pro-
Juce a posterior Jensity across values of Q
8
. Top row: /
flat prior las no effect on tle posterior. MiJJle row: /
truncateJ prior tlat assi,ns probability zero to all moJels
witl Q
8
> O.75. Lottom row: / peakeJ prior leaJs to a
peakeJ posterior.
Tlis implies assi,nin, a prior probability of zero (or sometlin, very close
to zero) to all moJels witl Q
8
> O.75. Maybe tle most transparent way
Figure 2.3
+8 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
'JHVSF losterior Jensities proJuceJ by ,riJ approx-
imation. ln eacl row, a specific prior is combineJ witl
tle likelilooJs of observin, 6 waters anJ + lanJs to pro-
Juce a posterior Jensity across values of Q
8
. Top row: /
flat prior las no effect on tle posterior. MiJJle row: /
truncateJ prior tlat assi,ns probability zero to all moJels
witl Q
8
> O.75. Lottom row: / peakeJ prior leaJs to a
peakeJ posterior.
Tlis implies assi,nin, a prior probability of zero (or sometlin, very close
to zero) to all moJels witl Q
8
> O.75. Maybe tle most transparent way
Maximum likelihood

Most published papers report


only:
1. maximum likelihood
estimate (MLE)
2. standard error (SE)
+8 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
0.00 0.50 1.00
prior

0.00 0.50 1.00


likelihood
=
0.00 0.50 1.00
posterior
'JHVSF losterior Jensities proJuceJ by ,riJ approx-
imation. ln eacl row, a specific prior is combineJ witl
tle likelilooJs of observin, 6 waters anJ + lanJs to pro-
Juce a posterior Jensity across values of Q
8
. Top row: /
flat prior las no effect on tle posterior. MiJJle row: /
truncateJ prior tlat assi,ns probability zero to all moJels
witl Q
8
> O.75. Lottom row: / peakeJ prior leaJs to a
peakeJ posterior.
Tlis implies assi,nin, a prior probability of zero (or sometlin, very close
to zero) to all moJels witl Q
8
> O.75. Maybe tle most transparent way
MLE
L
i
k
e
l
i
h
o
o
d
SE
Maximum likelihood

MLE and SE are crude approximations of


posterior:

MLE is a crypto-Bayesian special case


+2 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
More important for now is to appreciate tle job it Joes. Tle role of
tlis probability of eviJence insiJe tle conJitionin, en,ine, Layes' tle-
orem, is to normalize tle posterior, so tlat it always sums to one. Tlis
ensures tlat tle posterior is a colerent probability Jensity, just like tle
prior. Tle relative ma,nituJes of tle Jifferent moJels unJer consiJera-
tion will not clan,e, wlatever value you assi,n to lr(%). Tlis is because
a posterior probability is proportional to tle proJuct of tle likelilooJ
anJ tle prior:
lr(.
Q
8
|%) lr(%|.
Q
8
) lr(.
Q
8
).
ln many cases, tlese proJucts are all you actually neeJ, because ,iven
tlese proJucts, you can still compare tle relative posterior probabilities
of tle moJels, even tlou,l you can't say wlat tle precise probability
is. lf you aJJ up all of tlese proJucts anJ JiviJe eacl proJuct by tlat
sum, you stanJarJize tle ensemble of proJucts so tlat tley comprise a
probability Jensity, a proper posterior.
We set out lere to talk about likelilooJ, but somelow we enJeJ
up a,ain at priors. Tle prior Jirectly influences tle posterior by be-
in, tle ori,inal Jistribution tlat is moJifieJ insiJe tle conJitionin,
en,ine, Layes' tleorem. /s a consequence, tle prior also Jetermines
wlicl moJels lave stron,er wei,lt in tle calculation of lr(%).
lrior probabilities matter a lot. Lxcept wlen tley Jon't. lt turns out
tlat tle vast majority of Layesian analyses consciously aJopt prior prob-
ability Jensities Jesi,neJ to lave little to no influence on tle posterior.
/nJ tlat is wlere we turn next, to explainin, maximumlikelilooJ esti-
mation as a special case of Layesian inference from posterior probability.
+VTUJGZJOH -JLFMJIPPE
Layes' tleorem proviJes a principleJ way to relate tle eviJence to
our moJels. Lut you probably alreaJy know tlat most of applieJ statisti-
cal inference i,nores Layes' tleoremin favor of eitler (1) 1-values or (2)
maximizin, likelilooJ. ften tlese two metloJs are blenJeJ in illo,ical
ways. Neitler of tlese approacles typically mentions Layes' tleorem or
posterior probability. HowJo tlese non-Layesian" approacles relate to
tle posterior probabilities tlat proviJe relative measures of confiJence
in moJels`
/s you'll see in tle next clapter, tle use of 1-values in tle testin,
of null lypotleses-tle Tyranny of lisler-cannot be easily justifieJ.
Still, it can be unJerstooJ as a kinJ of Layesian analysis tlat prefers a
null moJel. Maximum likelilooJ estimation las an even stron,er con-
nection to tle posterior Jensity. lnJeeJ, Causs anJ Laplace essentially
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 71
because l want to plant tlis Jistinction in your minJ, wlere it will in-
cubate anJ later emer,e into practice. We will always neeJ sometlin,
more tlan posterior probabilities.
4BNQMJOH GSPN UIF OBJWF QPTUFSJPS Lefore you start summariz-
in, tle slape of tle posterior witl intervals, it is lelpful to learn an
approacl for Jescribin, tle entire posterior tlat is very flexible anJ usu-
ally mucl more intuitive tlan matlematical statements about covera,e
etc. Tlis approacl is to treat tle posterior like tle probability Jensity
it is anJ TBNQMF from it. Wlen you sample from tle posterior, tle sam-
ples are moJels. lor now, tlese moJels are just parameter values. nce
you lave samples from tle posterior, you can treat tlem like Jata, us-
in, tlem to compute various kinJs of confiJence interval or even to
simulate preJictions.
2.5.2.1. 4JNVMBUJOH TBNQMFT GSPN UIF QPTUFSJPS Here's tle ,eneral no-
tion. nce you lave an estimate of tle posterior Jensity, you lave a
probability Jistribution of moJels. Tlis Jistribution Jefines a samplin,
space for ,eneratin, preJictions fromtle moJel. Lacl unique moJel, say
Q
8
= O.2, preJicts a unique pattern of Jata. You alreaJy know tle naive
posterior in tle case of tle proportion of water problem. Tle posterior
probability of eacl value of Q
8
is simply proportional to tle likelilooJ:
lr(.
Q
8
|%) lr(%|.
Q
8
).
ln tlis case, tle likelilooJ function is pretty simple, anJ if you fluent in
tle necessary matlematics, it isn't larJ to compute all kinJs of useful
tlin,s about tlis posterior. Lut most natural anJ social scientists are far
from fluent in tle necessary matlematics. Most know wlat an inte,ral
is, but wlen askeJ to actually manipulate one, a kinJ of terror sets in.
/n empowerin, approacl tlat allows for most of tle same useful cal-
culations, but wlicl requires less inte,ral calculus, is to sample from tle
posterior anJ tlen perform empirical" calculations on tle samples. /s
a bonus, wlen you eventually learn low Markov Clain Monte Carlo
(MCMC) estimation works (Clapter 11), you'll see tlat it proviJes
notlin, otler tlan sucl samples. So everytlin, you learn lere will be
immeJiately applicable to makin, inferences from MCMC estimations.
Here's tle recipe.
(1) Lstimate tle posterior Jensity, as a list of probabilities corre-
sponJin, to moJels (parameter values).
Justifying maximum likelihood
1. Bury the prior in data.
2. Use a weak prior.
3. Summarize what the evidence says.
Justifying maximum likelihood
1. Bury the prior in data.
2. Use a weak prior.
3. Summarize what the evidence says.
1. Bury the prior

As amount of data increases, prior becomes less


and less important.

This is why maximum likelihood consistent for


infinite sample size; thats when its the same as
posterior.
2.3. JLSTllYlNC LllLLlHD +5
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
2
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
(a) n = 10
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
3
0
.
0
0
6
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
0
4
0
.
0
8
l
i
k
e
l
i
h
o
o
d
(b) n = 50
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
4
0
.
0
0
8
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
0
4
0
.
0
8
l
i
k
e
l
i
h
o
o
d
(c) n = 100
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
5
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
0
.
0
1
5
l
i
k
e
l
i
h
o
o
d
(d) n = 1000
'JHVSF How to bury your prior in Jata. ln eacl panel, tle tlree
curves represent tle prior (JasleJ), tle likelilooJ (blue), anJ tle pos-
terior (black). Lacl panel slows tle posterior for Jifferent amounts of
Jata, O, from 1O to 1OOO tosses of tle ,lobe. ln eacl case, tle propor-
tion of tosses recorJeJ as water remains constant at O.6. Tle prior is tle
same in eacl case, closen to be leavily wei,lteJ towarJs small values.
ln tle upper-left, O = 1O is a small sample, anJ tle likelilooJ curve is
quite wiJe. /s a result, tle posterior enJs up bein, a compromise be-
tween tle prior anJ tle likelilooJ, layin, about lalf way between tleir
peaks. ln tle upper-ri,lt, O = 5O, anJ tle aJJitional eviJence narrows
tle likelilooJ curve anJ pulls tle posterior mucl closer to O.6, tle max-
imum likelilooJ estimate. ln tle bottom row, at O = 1OO anJ O = 1OOO,
tlere is so mucl eviJence nowtlat tle prior is essentially overwlelmeJ,
lavin, ne,li,ible influence on tle posterior.
Figure 2.4. How to bury your prior in data.
prior
posterior
likelihood
2.3. JLSTllYlNC LllLLlHD +5
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
2
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
(a) n = 10
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
3
0
.
0
0
6
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
0
4
0
.
0
8
l
i
k
e
l
i
h
o
o
d
(b) n = 50
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
4
0
.
0
0
8
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
0
4
0
.
0
8
l
i
k
e
l
i
h
o
o
d
(c) n = 100
0.00 0.50 1.00
0
.
0
0
0
0
.
0
1
5
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
0
.
0
1
5
l
i
k
e
l
i
h
o
o
d
(d) n = 1000
'JHVSF How to bury your prior in Jata. ln eacl panel, tle tlree
curves represent tle prior (JasleJ), tle likelilooJ (blue), anJ tle pos-
terior (black). Lacl panel slows tle posterior for Jifferent amounts of
Jata, O, from 1O to 1OOO tosses of tle ,lobe. ln eacl case, tle propor-
tion of tosses recorJeJ as water remains constant at O.6. Tle prior is tle
same in eacl case, closen to be leavily wei,lteJ towarJs small values.
ln tle upper-left, O = 1O is a small sample, anJ tle likelilooJ curve is
quite wiJe. /s a result, tle posterior enJs up bein, a compromise be-
tween tle prior anJ tle likelilooJ, layin, about lalf way between tleir
peaks. ln tle upper-ri,lt, O = 5O, anJ tle aJJitional eviJence narrows
tle likelilooJ curve anJ pulls tle posterior mucl closer to O.6, tle max-
imum likelilooJ estimate. ln tle bottom row, at O = 1OO anJ O = 1OOO,
tlere is so mucl eviJence nowtlat tle prior is essentially overwlelmeJ,
lavin, ne,li,ible influence on tle posterior.
Figure 2.4. How to bury your prior in data.
prior
posterior
likelihood
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
2
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
(a) n = 10
2. Weak prior

Most applied Bayesian analysis in natural


sciences uses weak priors.

A weak prior is one that has little effect on the


posterior.

Common approach is a uniform prior with


nearly constant probability for all models.
+8 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.00 0.50 1.00
0
.
0
0
0
0
.
0
0
2
0
.
0
0
4
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
0.00 0.50 1.00
0
.
0
0
0
0
0
.
0
0
1
5
0
.
0
0
3
0
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
0.00 0.50 1.00
0
.
0
0
0
0
0
.
0
0
1
5
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
0.00 0.50 1.00
0
.
0
0
0
0
0
.
0
0
1
5
proportion water
p
r
o
b
a
b
i
l
i
t
y
0
.
0
0
0
.
1
0
0
.
2
0
l
i
k
e
l
i
h
o
o
d
'JHVSF How to make a prior uninformative. Tle
JotteJ, blue, anJ black curves are a,ain tle prior, likeli-
looJ, anJ posterior, respectively. Lacl plot above slows
tlese tlree curves, for iJentical Jata (O = 1O, 6 water
anJ + lanJ), but witl Jifferent priors. /s tle variance of
tle prior increases from top-left to bottom-ri,lt, tle prior
becomes flatter anJ flatter, anJ as a consequence tle pos-
terior becomes more anJ more similar to tle likelilooJ,
until tle posterior anJ tle likelilooJ are iJentical in tleir
proportions.
prior
posterior
likelihood
Figure 2.5. How to make a prior weak.
2. Weak prior

Complications:

If goal is to represent prior ignorance, flat


prior only one interpretation.

Rarely are you completely ignorant.

Can often make better inferences with other


priors.
Bayesian maximum likelihood

Use maximum likelihood to estimate naive


posterior:
1. Find model that maximizes likelihood; this
will be peak of naive posterior.
2. Use approximate shape of posterior near the
peak to quantify uncertainty: standard errors
1. Find the maximum

Three general methods:

Analytical solution (sometimes possible)

Exhaustive search (usually impractical)

Heuristic search (very common)


Log-likelihood important
2.+. THL L/YLSl/N MLTHD l M/XlMLM LllLLlHD 55
0.0 0.4 0.8
0
.
0
0
0
.
1
0
0
.
2
0
p
w
l
i
k
e
l
i
h
o
o
d
0.0 0.4 0.8
5
1
0
2
0
p
w
-
l
o
g

l
i
k
e
l
i
h
o
o
d
'JHVSF Minimizin, tle ne,ative lo,aritlm of tle
likelilooJ is equivalent to maximizin, tle likelilooJ.
maximizes tle likelilooJ, tlen it will also minimize tle ne,ative lo,-
likelilooJ. Tle important insi,lt, as slown in li,ure 2.5, is tlat wlere
tle likelilooJ curve las a peak, tle ne,ative lo,-likelilooJ curve las a
naJir. We can ,o back anJ fortl between likelilooJ anJ ne,ative lo,-
likelilooJ wlenever we want, proviJeJ we keep in minJ wletler we
want to maximize (likelilooJ) or minimize (ne,ative lo,-likelilooJ).
To ,et back to likelilooJ, just multiply tle ne,ative lo,-likelilooJ by
1 anJ tlen exponentiate it: exp((lo, -)) = -.
So wly use lo,-likelilooJ ratler tlan raw likelilooJ` lsn't tlis just
tle kinJ of ,ame matlematicians play on scientists, to make matters
more confusin, tlan tley sloulJ be` SaJly, no. l aJmit matlematicians
Jo sometimes make matters more confusin, tlan tley sloulJ or neeJ
to be, altlou,l rarely is tlis intentional. Tle key practical reason for
usin, ne,ative lo,-likelilooJ is tlat computer software lanJles small
Jecimal values poorly. ln even reasonably lar,e sets of Jata, even tle
maximum likelilooJ will be a very small number, close to zero. Tlis is
because tle likelilooJ is tle joint likelilooJ of all tle Jata. Tlis means
tle likelilooJ is tle proJuct of many inJiviJual probabilities, eacl less
tlan one. /s sucl a proJuct ,ets close to zero, most computer software,
incluJin, l, will rounJ it to exactly zero. n tle lo,-likelilooJ scale,
lowever, tlis value becomes a very lar,e positive number. /s a result,
tle computer won't feel inclineJ to rounJ it. You can verify tlis for
yourself by usin, tle tlree lines of l coJe below:

Easier to use

More accurate
Heuristic search

Use mle2 in bbmle package:


2.+. THL L/YLSl/N MLTHD l M/XlMLM LllLLlHD 63
l coJe
2.16
1hs1a.packages{ "bbme" , depehdehc1es="0epehds" )
Like all optional packa,es for l, make it available for your current ses-
sion witl tle 1bJaJy commanJ:
l coJe
2.17
1bJaJy{bbme)
Lolker's book, &DPMPHJDBM .PEFMT BOE %BUB JO 3,
33
is li,lly recommenJeJ.
lt mainly focuses on moJels tlat ,o beyonJ stanJarJ linear anJ ,eneral-
izeJ linear moJels. So it overlaps quite little witl tlis book in tle Jetails
anJ content of examples, altlou,l it las a very similar plilosoply anJ
also slows tle reaJer a lot of l coJe.
Tlis library proviJes a convenient function, me2, to specify anJ fit
maximum likelilooJ moJels, anJ we'll use it a lot in tle book. me2
actually uses op11m, so it's not Join, anytlin, Jifferent, unJer tle looJ.
Lut me2 is usually easier to use anJ Jebu, tlan op11m. To replicate tle
result we ,ot from callin, op11m Jirectly, use tle coJe:
l coJe
2.18
pw.me2 <- me2{ hw - db1hom{ s1ze=10 , pJob=pw ) ,
da1a=1s1{hw=6) , s1aJ1=1s1{pw=0.5) )
WaJh1hg messages.
1. Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
2. Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
l'll explain tle components of tle coJe in a moment. lirst, let's talk
about tlose warnin,s. Wlat are NaNs, anJ wly were tley proJuceJ,
anJ wly Joes l warn us about tlem` Tle reason for warnin,s in tlis
case is tlat l trieJ to calculate likelilooJs for a couple of moJels witl
values of Q
8
,reater tlan 1 or less tlan O. Since Q
8
< O, for example,
is not a probability, db1hom complains anJ returns an error nameJ hah,
wlicl means OPU B OVNCFS. lrove tlis to yourself by executin,:
l coJe
2.19
db1hom{ 6 , s1ze=10 , pJob=-1 )
|1] hah
WaJh1hg message.
Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
/ny value for pJob tlat is less tlan zero or ,reater tlan one will pro-
Juce tle same error. l can nearly always keep searclin, anJ finJ tle
maximum likelilooJ estimate, wlen tlis lappens. Tle best way to be
data
distributed as
likelihood function (stochastic node)
tell R where
the data are
starting values
for parameters
Heuristic search
2.+. THL L/YLSl/N MLTHD l M/XlMLM LllLLlHD 63
l coJe
2.16
1hs1a.packages{ "bbme" , depehdehc1es="0epehds" )
Like all optional packa,es for l, make it available for your current ses-
sion witl tle 1bJaJy commanJ:
l coJe
2.17
1bJaJy{bbme)
Lolker's book, &DPMPHJDBM .PEFMT BOE %BUB JO 3,
33
is li,lly recommenJeJ.
lt mainly focuses on moJels tlat ,o beyonJ stanJarJ linear anJ ,eneral-
izeJ linear moJels. So it overlaps quite little witl tlis book in tle Jetails
anJ content of examples, altlou,l it las a very similar plilosoply anJ
also slows tle reaJer a lot of l coJe.
Tlis library proviJes a convenient function, me2, to specify anJ fit
maximum likelilooJ moJels, anJ we'll use it a lot in tle book. me2
actually uses op11m, so it's not Join, anytlin, Jifferent, unJer tle looJ.
Lut me2 is usually easier to use anJ Jebu, tlan op11m. To replicate tle
result we ,ot from callin, op11m Jirectly, use tle coJe:
l coJe
2.18
pw.me2 <- me2{ hw - db1hom{ s1ze=10 , pJob=pw ) ,
da1a=1s1{hw=6) , s1aJ1=1s1{pw=0.5) )
WaJh1hg messages.
1. Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
2. Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
l'll explain tle components of tle coJe in a moment. lirst, let's talk
about tlose warnin,s. Wlat are NaNs, anJ wly were tley proJuceJ,
anJ wly Joes l warn us about tlem` Tle reason for warnin,s in tlis
case is tlat l trieJ to calculate likelilooJs for a couple of moJels witl
values of Q
8
,reater tlan 1 or less tlan O. Since Q
8
< O, for example,
is not a probability, db1hom complains anJ returns an error nameJ hah,
wlicl means OPU B OVNCFS. lrove tlis to yourself by executin,:
l coJe
2.19
db1hom{ 6 , s1ze=10 , pJob=-1 )
|1] hah
WaJh1hg message.
Th db1hom{x, s1ze, pJob, og) . hahs pJoduced
/ny value for pJob tlat is less tlan zero or ,reater tlan one will pro-
Juce tle same error. l can nearly always keep searclin, anJ finJ tle
maximum likelilooJ estimate, wlen tlis lappens. Tle best way to be
2.+. THL L/YLSl/N MLTHD l M/XlMLM LllLLlHD 53
a buncl of Jifferent moJels anJ explorin, tle space of moJels, until we
finJ a ,ooJ canJiJate for tle maximum likelilooJ.
/t tlis point, some reaJers will want to skip aleaJ. lf you've laJ
a ,ooJ eJucation in lislerian statistics, tlen you know low maximum
likelilooJ works, at least in tle abstract. However, tle sections to fol-
low be,in to incluJe a substantial amount of l coJe for actually Jo-
in, likelilooJ searcles. lmportantly, l use tlese sections to introJuce
a family of ,eneral maximum likelilooJ commanJs tlat tle book will
use a,ain anJ a,ain to fit many Jifferent sorts of moJels. Tlese com-
manJs are not tle easiest way to fit tlese moJels, but tley Jo proviJe a
unifieJ anJ transparent approacl tlat ultimately is mucl more power-
ful anJ illuminatin, tlan tle mena,erie of specialty commanJs like ,
, , etc. So l recommenJ at least skimmin, tlrou,l anJ stoppin,
at tle coJe boxes to execute anJ meJitate on tle practical Jifficulties of
actually makin, your computer Jo tlis stuff. Tle power of an approacl
is often inversely proportional to low larJ is to learn. So wlen some
approacl is easier to learn, you are usually ,ivin, up power by usin, it.
/fter several clapters of fittin, moJels witl ,eneralizeJ maximum like-
lilooJ commanJs, you will be ,rateful l forceJ you to Jo it tlis way,
because you will unJerstanJ tle moJels at a Jeeper level, anJ you will
be able to fit many moJels for wlicl tlere are no built-in convenience
commanJs like .
2.4.1. Anaytica soution. lntlis case, our maximumlikelilooJ prob-
lem is simple enou,l to be Jone analytically. lor tle sake of Jemysti-
fyin, tle process, lere's low you can Jerive matlematical expressions
for some maximum likelilooJ estimates. Tle ,eneral approacl is just
to maximize tle likelilooJ, so tlis is an exercise in basic calculus. lf
calculus scares you, tlen skip aleaJ tle next subsection, to wlere we
Jo tlis witl l coJe. lt is wortl knowin, tle outline of tlis ar,ument,
lowever, as tle metloJ of least-squares estimation commonly useJ to
fit linear re,ression moJels (Clapter +) usually relies upon tlis kinJ of
analysis for its justification.
We alreaJy know tle likelilooJ function, wlicl we aim to maxi-
mize:
-(Q
8
|O
8
, O) =
O!
O
8
!(O O
8
)!
Q
O
8
(1 Q)
OO
8
,
wlere Q
8
is tle moJel, tle proportion of tle Lartl covereJ in water, O
8
is tle observeJ count of water points, anJ O is tle total number of points
sampleJ. Next, we are ,oin, to take tle natural lo, of botl siJes. lt is
almost always easier to work witl probability, if we work in (natural)
2. Standard errors

Maximum likelihood estimate (MLE) not


enough; Need estimate of shape of posterior.
66 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
of tle precision of tlat estimate, anJ tle lo,-likelilooJ at tle estimate.
Let's focus on tle coefficients table for now:
Coe111c1eh1s.
Es11ma1e S1d. EJJoJ z vaue PJ{z)
pw 0.60000 0.15492 3.873 0.0001075 ***
Lacl row in tlis table is a Jimension in moJel space, often calleJ a
coefficient or a parameter. Tlere is only one sucl Jimension in tlis
case, tle probability Q
8
. Tle first value in tle row is tle estimate, tle
value of tlis variable tlat ,ives us tle moJel witl tle li,lest likelilooJ.
Tle seconJ value is tle TUBOEBSE FSSPS of tle estimate. l explain wlat tlis
value means anJ low to compute it yourself, usin, tle likelilooJ, in tle
next section. Tle last two values, a so-calleJ [ value anJ lr([), lave to
Jo witl a null lypotlesis test. Tley sloulJ be i,noreJ, as l'll ar,ue at
len,tl in tle next clapter.
ln fact, if all you want is summary information about tle estimates
anJ tleir precision, better to use an alternative commanJ tlat comes
witl tle LlLl/lY tlat accompanies tlis book:
l coJe
2.23
pJec1s{ pw.me2 )
Es11ma1e S1d. EJJoJ 2.5 97.5
pw 0.6 0.1549193 0.2963637 0.9036363
Tle pJec1s commanJ, after lrencl QSDJT (pronounceJ sometlin, like
QSBZTFF), proviJes a slort summary of tle coefficients only. l use it
tlrou,lput tle book, as well as in my Jaily work. lt's certainly not all tle
information you'll neeJ, but at least it Joesn't Jistract you witl tlose si,-
nificance tests. Tle estimate anJ stanJarJ error columns are tle same
as tlose proviJeJ by summaJy. Lut insteaJ of tlose pointless 1-values
(see an extenJeJ ar,ument a,ainst tlem in Clapter 3), you ,et insteaJ
a 95% quaJratic estimate confiJence interval. Tle next major section
of tlis clapter ,oes into wlat tlese confiJence intervals are anJ low to
compute tlem.
2.5. ConfiJence: More Tban tbe Maximum
Most reaJers lave probably watcleJ track anJ fielJ events at tle
Summer lympics or similar competitions. ln a race like tle 1OO me-
ter Jasl, we can almost always iJentify a sin,le winner, tle competitor
witl tle slortest finislin, time. ls tlis winner tle fastest runner` ln tlis
sin,le race, yes. Lut wlat about on avera,e` To make tlis more inter-
estin,, suppose tle same runners will compete a,ain in tle same race in
one montl's time. lf you are askeJ to preJict tle outcome of tlis future
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 69
0.00 0.25 0.50 0.75 1.00
0
.
0
0
0
.
1
0
0
.
2
0
p
W
l
i
k
e
l
i
h
o
o
d
10
20
100
0.00 0.25 0.50 0.75 1.00
5
1
0
1
5
2
0
p
W
-
l
o
g

l
i
k
e
l
i
h
o
o
d
10
20
100
0.00 0.25 0.50 0.75 1.00
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
p
W
l
i
k
e
l
i
h
o
o
d
0.00 0.25 0.50 0.75 1.00
0
1
0
2
0
3
0
4
0
p
W
-
l
o
g

l
i
k
e
l
i
h
o
o
d
'JHVSF Tle likelilooJ function (unstanJarJizeJ
naive posterior) anJ confiJence in moJels. ln eacl of
tle four plots, tle tlree curves represent likelilooJ func-
tions at tlree Jifferent sample sizes, O = 1O, O = 2O anJ
O = 1OO. Tle top row slows tlese functions for cases in
wlicl tle maximumlikelilooJ estimate is

Q
8
= O.6. Tle
bottom slows tlese functions wlen tle maximum likeli-
looJ estimate is insteaJ

Q
8
= O.1. Tle left plot in eacl
row slows tle raw likelilooJ scale, wlile tle ri,lt plot
slows tle same functions on tle lo,-likelilooJ scale,
makin, it mucl easier to appreciate tle Jifferent curva-
tures as a function of sample size.
Figure 2.8. Shape of likelihood and confidence.
2. Standard errors

Standard error:
If posterior were normal in shape, then standard
deviation of that normal density.

Also known as quadratic approximation


66 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
of tle precision of tlat estimate, anJ tle lo,-likelilooJ at tle estimate.
Let's focus on tle coefficients table for now:
Coe111c1eh1s.
Es11ma1e S1d. EJJoJ z vaue PJ{z)
pw 0.60000 0.15492 3.873 0.0001075 ***
Lacl row in tlis table is a Jimension in moJel space, often calleJ a
coefficient or a parameter. Tlere is only one sucl Jimension in tlis
case, tle probability Q
8
. Tle first value in tle row is tle estimate, tle
value of tlis variable tlat ,ives us tle moJel witl tle li,lest likelilooJ.
Tle seconJ value is tle TUBOEBSE FSSPS of tle estimate. l explain wlat tlis
value means anJ low to compute it yourself, usin, tle likelilooJ, in tle
next section. Tle last two values, a so-calleJ [ value anJ lr([), lave to
Jo witl a null lypotlesis test. Tley sloulJ be i,noreJ, as l'll ar,ue at
len,tl in tle next clapter.
ln fact, if all you want is summary information about tle estimates
anJ tleir precision, better to use an alternative commanJ tlat comes
witl tle LlLl/lY tlat accompanies tlis book:
l coJe
2.23
pJec1s{ pw.me2 )
Es11ma1e S1d. EJJoJ 2.5 97.5
pw 0.6 0.1549193 0.2963637 0.9036363
Tle pJec1s commanJ, after lrencl QSDJT (pronounceJ sometlin, like
QSBZTFF), proviJes a slort summary of tle coefficients only. l use it
tlrou,lput tle book, as well as in my Jaily work. lt's certainly not all tle
information you'll neeJ, but at least it Joesn't Jistract you witl tlose si,-
nificance tests. Tle estimate anJ stanJarJ error columns are tle same
as tlose proviJeJ by summaJy. Lut insteaJ of tlose pointless 1-values
(see an extenJeJ ar,ument a,ainst tlem in Clapter 3), you ,et insteaJ
a 95% quaJratic estimate confiJence interval. Tle next major section
of tlis clapter ,oes into wlat tlese confiJence intervals are anJ low to
compute tlem.
2.5. ConfiJence: More Tban tbe Maximum
Most reaJers lave probably watcleJ track anJ fielJ events at tle
Summer lympics or similar competitions. ln a race like tle 1OO me-
ter Jasl, we can almost always iJentify a sin,le winner, tle competitor
witl tle slortest finislin, time. ls tlis winner tle fastest runner` ln tlis
sin,le race, yes. Lut wlat about on avera,e` To make tlis more inter-
estin,, suppose tle same runners will compete a,ain in tle same race in
one montl's time. lf you are askeJ to preJict tle outcome of tlis future
-3 -2 -1 0 1 2 3
-
8
-
6
-
4
-
2
0
parabola
x
y
-3 -2 -1 0 1 2 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
exp(parabola)
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
0
.
1
0
0
.
2
0
p
W
l
i
k
e
l
i
h
o
o
d
1
1
2
2
1: Linear approximation (line)
2: Quadratic approximation (parabola)
Quadratic approximation

Standard error is just the square-root of the


variance estimate.
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 87
later in tle book. /t tlat time, l'll also explain wly tlis tlin, is calleJ
hess1ah.
2.5.+.6. (FUUJOH UIF DVSWBUVSF GSPN You've alreaJy seen tlat
me2 will compute tle stanJarJ error for you, in it's summaJy table. You
can extract tlese stanJarJ errors Jirectly from tle table witl:
l coJe
2.38
summaJy{pw.me2)_coe1|,2]
|1] 0.1549193
Tle symbol pw.me2 is tle symbol you storeJ tle maximum likelilooJ
estimate in before. Tle coJe above just ,enerates tle summary (summaJy),
selects tle coefficients table from it (_coe1), anJ tlen extracts only tle
seconJ column of tlat table (|,2]).
Wlat if you want tle raw seconJ Jerivative` Well, you coulJ com-
pute it from tle stanJarJ error, knowin, tlat
2
lo, -/Q
2
8
= 1/
2
. lt
turns out tlat me2 Joesn't store a hess1ah witl tle seconJ Jerivatives,
like op11m Joes. lnsteaJ, it stores a matrix of tle reciprocals of tle sec-
onJ Jerivatives. Tlis makes tle Jia,onal of tlis matrix variances,
2
.
You can extract tlis matrix usin,:
l coJe
2.39
vcov{ pw.me2 )
Later, wlen we lave multi-Jimensional moJels, we'll work witl tlis
matrix, a so-calleJ WBSJBODFDPWBSJBODF NBUSJY, to summarize tle posterior
Jistribution. So it's ,ooJ to know low to access it. Most of tle moJel
fittin, commanJs in l support extractin, tlis variance-covariance ma-
trix witl tle commanJ vcov. However, vcov Joes not work witl op11m
Jirectly. Don't worry: Wlen we neeJ tlis matrix, you'll see exactly low
to use it. /nJ since you just suffereJ tlrou,l all tlat junk about seconJ
Jerivatives, in principle you coulJ compute tlis matrix yourself.
2.5.+.7. &TUJNBUJOH UIF DVSWBUVSF OVNFSJDBMMZ ln most cases, wlat op-
11m anJ me2 are Join, is usin, a numerical teclnique to approximate
tle seconJ Jerivatives. Tley aren't Join, analytical calculus. Still,
sometimes neitler op11m nor me2 can mana,e to compute estimates of
tle seconJ Jerivatives. Tlat Joesn't mean all is lost, lowever. Some-
times you can Jo tle estimate yourself, witl a similar numerical ap-
proacl. r maybe you just want to know low op11m is Join, it. Here's
some coJe to estimate tle seconJ Jerivate of tle lo,-likelilooJ numer-
ically, for our proportion of water problem:
pw
pw 0.024
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 81
Jensity for a normal is:
-(Y|,
2
) =
1

2
2
exp

(Y )
2
2
2

,
wlere Y is an observeJ value, is tle mean, anJ
2
is tle variance.
We're treatin, tlis as tle likelilooJ of tle observeJ mean Y, wlen tle
true mean is anJ tle variance of tle likelilooJ function is
2
. (lf you
Jon't reco,nize tlis formula, or Jon't unJerstanJ wlere it comes from,
tlat's okay. You'll ,et a crasl course in tlis important Jensity in a later
clapter.) Tle lo,-likelilooJ is tlen:
lo, - =
(Y )
2
2
2

1
2
lo,(2) lo,().
Now to compute tle curvature. Tle Jerivative witl respect to Y is:
lo, -
Y
=
Y

2
.
/nJ finally tle seconJ Jerivative, yielJin, tle curvature, is:

2
lo, -
Y
2
=
1

2
.
Solvin, for
2
, tle variance of tle normal probability Jensity, we ,et:

2
=
1

2
lo, -/Y
2
.
ln otler worJs, tle variance of tle posterior is approximately-if tle
posterior is approximately normal-tle reciprocal of tle seconJ Jeriv-
ative of tle ne,ative lo,-likelilooJ. So if we can estimate tle seconJ
Jerivative of tle ne,ative lo,-likelilooJ, all we lave to Jo is JiviJe one
by it to ,et an estimate of tle variance of tle posterior Jistribution.
Lack to our proportion of water estimate,

Q
8
= O.6. We computeJ
tle curvature of tle lo,-likelilooJ at tle maximum likelilooJ estimate
to be about +1.667. So it follows tlen tlat our estimate of tle variance
of posterior Jistribution for

Q
8
is:

1
+1.667

O.O2+.
Most functions in l use tle stanJarJ Jeviation of a normal, , insteaJ
of tle variance,
2
. Tle stanJarJ Jeviation is just tle square root of
tle variance, =

2
. Tle stanJarJ Jeviation lere is tlerefore about
O.15+9-tlat's tle stanJarJ error fromtle coefficients table proJuceJ by
summaJy{pw.me2). So if we want to finJ tle values of Q
8
tlat enclose
95% of tle probabilities of moJels, tlen we just ask l:
Quadratic approximation

Standard error is just the square-root of the


variance estimate.
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 87
later in tle book. /t tlat time, l'll also explain wly tlis tlin, is calleJ
hess1ah.
2.5.+.6. (FUUJOH UIF DVSWBUVSF GSPN You've alreaJy seen tlat
me2 will compute tle stanJarJ error for you, in it's summaJy table. You
can extract tlese stanJarJ errors Jirectly from tle table witl:
l coJe
2.38
summaJy{pw.me2)_coe1|,2]
|1] 0.1549193
Tle symbol pw.me2 is tle symbol you storeJ tle maximum likelilooJ
estimate in before. Tle coJe above just ,enerates tle summary (summaJy),
selects tle coefficients table from it (_coe1), anJ tlen extracts only tle
seconJ column of tlat table (|,2]).
Wlat if you want tle raw seconJ Jerivative` Well, you coulJ com-
pute it from tle stanJarJ error, knowin, tlat
2
lo, -/Q
2
8
= 1/
2
. lt
turns out tlat me2 Joesn't store a hess1ah witl tle seconJ Jerivatives,
like op11m Joes. lnsteaJ, it stores a matrix of tle reciprocals of tle sec-
onJ Jerivatives. Tlis makes tle Jia,onal of tlis matrix variances,
2
.
You can extract tlis matrix usin,:
l coJe
2.39
vcov{ pw.me2 )
Later, wlen we lave multi-Jimensional moJels, we'll work witl tlis
matrix, a so-calleJ WBSJBODFDPWBSJBODF NBUSJY, to summarize tle posterior
Jistribution. So it's ,ooJ to know low to access it. Most of tle moJel
fittin, commanJs in l support extractin, tlis variance-covariance ma-
trix witl tle commanJ vcov. However, vcov Joes not work witl op11m
Jirectly. Don't worry: Wlen we neeJ tlis matrix, you'll see exactly low
to use it. /nJ since you just suffereJ tlrou,l all tlat junk about seconJ
Jerivatives, in principle you coulJ compute tlis matrix yourself.
2.5.+.7. &TUJNBUJOH UIF DVSWBUVSF OVNFSJDBMMZ ln most cases, wlat op-
11m anJ me2 are Join, is usin, a numerical teclnique to approximate
tle seconJ Jerivatives. Tley aren't Join, analytical calculus. Still,
sometimes neitler op11m nor me2 can mana,e to compute estimates of
tle seconJ Jerivatives. Tlat Joesn't mean all is lost, lowever. Some-
times you can Jo tle estimate yourself, witl a similar numerical ap-
proacl. r maybe you just want to know low op11m is Join, it. Here's
some coJe to estimate tle seconJ Jerivate of tle lo,-likelilooJ numer-
ically, for our proportion of water problem:
pw
pw 0.024
[1] 0.1549193
sqrt( 0.024 )
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 81
Jensity for a normal is:
-(Y|,
2
) =
1

2
2
exp

(Y )
2
2
2

,
wlere Y is an observeJ value, is tle mean, anJ
2
is tle variance.
We're treatin, tlis as tle likelilooJ of tle observeJ mean Y, wlen tle
true mean is anJ tle variance of tle likelilooJ function is
2
. (lf you
Jon't reco,nize tlis formula, or Jon't unJerstanJ wlere it comes from,
tlat's okay. You'll ,et a crasl course in tlis important Jensity in a later
clapter.) Tle lo,-likelilooJ is tlen:
lo, - =
(Y )
2
2
2

1
2
lo,(2) lo,().
Now to compute tle curvature. Tle Jerivative witl respect to Y is:
lo, -
Y
=
Y

2
.
/nJ finally tle seconJ Jerivative, yielJin, tle curvature, is:

2
lo, -
Y
2
=
1

2
.
Solvin, for
2
, tle variance of tle normal probability Jensity, we ,et:

2
=
1

2
lo, -/Y
2
.
ln otler worJs, tle variance of tle posterior is approximately-if tle
posterior is approximately normal-tle reciprocal of tle seconJ Jeriv-
ative of tle ne,ative lo,-likelilooJ. So if we can estimate tle seconJ
Jerivative of tle ne,ative lo,-likelilooJ, all we lave to Jo is JiviJe one
by it to ,et an estimate of tle variance of tle posterior Jistribution.
Lack to our proportion of water estimate,

Q
8
= O.6. We computeJ
tle curvature of tle lo,-likelilooJ at tle maximum likelilooJ estimate
to be about +1.667. So it follows tlen tlat our estimate of tle variance
of posterior Jistribution for

Q
8
is:

1
+1.667

O.O2+.
Most functions in l use tle stanJarJ Jeviation of a normal, , insteaJ
of tle variance,
2
. Tle stanJarJ Jeviation is just tle square root of
tle variance, =

2
. Tle stanJarJ Jeviation lere is tlerefore about
O.15+9-tlat's tle stanJarJ error fromtle coefficients table proJuceJ by
summaJy{pw.me2). So if we want to finJ tle values of Q
8
tlat enclose
95% of tle probabilities of moJels, tlen we just ask l:
Quadratic approximation

Standard error is just the square-root of the


variance estimate.
66 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
of tle precision of tlat estimate, anJ tle lo,-likelilooJ at tle estimate.
Let's focus on tle coefficients table for now:
Coe111c1eh1s.
Es11ma1e S1d. EJJoJ z vaue PJ{z)
pw 0.60000 0.15492 3.873 0.0001075 ***
Lacl row in tlis table is a Jimension in moJel space, often calleJ a
coefficient or a parameter. Tlere is only one sucl Jimension in tlis
case, tle probability Q
8
. Tle first value in tle row is tle estimate, tle
value of tlis variable tlat ,ives us tle moJel witl tle li,lest likelilooJ.
Tle seconJ value is tle TUBOEBSE FSSPS of tle estimate. l explain wlat tlis
value means anJ low to compute it yourself, usin, tle likelilooJ, in tle
next section. Tle last two values, a so-calleJ [ value anJ lr([), lave to
Jo witl a null lypotlesis test. Tley sloulJ be i,noreJ, as l'll ar,ue at
len,tl in tle next clapter.
ln fact, if all you want is summary information about tle estimates
anJ tleir precision, better to use an alternative commanJ tlat comes
witl tle LlLl/lY tlat accompanies tlis book:
l coJe
2.23
pJec1s{ pw.me2 )
Es11ma1e S1d. EJJoJ 2.5 97.5
pw 0.6 0.1549193 0.2963637 0.9036363
Tle pJec1s commanJ, after lrencl QSDJT (pronounceJ sometlin, like
QSBZTFF), proviJes a slort summary of tle coefficients only. l use it
tlrou,lput tle book, as well as in my Jaily work. lt's certainly not all tle
information you'll neeJ, but at least it Joesn't Jistract you witl tlose si,-
nificance tests. Tle estimate anJ stanJarJ error columns are tle same
as tlose proviJeJ by summaJy. Lut insteaJ of tlose pointless 1-values
(see an extenJeJ ar,ument a,ainst tlem in Clapter 3), you ,et insteaJ
a 95% quaJratic estimate confiJence interval. Tle next major section
of tlis clapter ,oes into wlat tlese confiJence intervals are anJ low to
compute tlem.
2.5. ConfiJence: More Tban tbe Maximum
Most reaJers lave probably watcleJ track anJ fielJ events at tle
Summer lympics or similar competitions. ln a race like tle 1OO me-
ter Jasl, we can almost always iJentify a sin,le winner, tle competitor
witl tle slortest finislin, time. ls tlis winner tle fastest runner` ln tlis
sin,le race, yes. Lut wlat about on avera,e` To make tlis more inter-
estin,, suppose tle same runners will compete a,ain in tle same race in
one montl's time. lf you are askeJ to preJict tle outcome of tlis future
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 87
later in tle book. /t tlat time, l'll also explain wly tlis tlin, is calleJ
hess1ah.
2.5.+.6. (FUUJOH UIF DVSWBUVSF GSPN You've alreaJy seen tlat
me2 will compute tle stanJarJ error for you, in it's summaJy table. You
can extract tlese stanJarJ errors Jirectly from tle table witl:
l coJe
2.38
summaJy{pw.me2)_coe1|,2]
|1] 0.1549193
Tle symbol pw.me2 is tle symbol you storeJ tle maximum likelilooJ
estimate in before. Tle coJe above just ,enerates tle summary (summaJy),
selects tle coefficients table from it (_coe1), anJ tlen extracts only tle
seconJ column of tlat table (|,2]).
Wlat if you want tle raw seconJ Jerivative` Well, you coulJ com-
pute it from tle stanJarJ error, knowin, tlat
2
lo, -/Q
2
8
= 1/
2
. lt
turns out tlat me2 Joesn't store a hess1ah witl tle seconJ Jerivatives,
like op11m Joes. lnsteaJ, it stores a matrix of tle reciprocals of tle sec-
onJ Jerivatives. Tlis makes tle Jia,onal of tlis matrix variances,
2
.
You can extract tlis matrix usin,:
l coJe
2.39
vcov{ pw.me2 )
Later, wlen we lave multi-Jimensional moJels, we'll work witl tlis
matrix, a so-calleJ WBSJBODFDPWBSJBODF NBUSJY, to summarize tle posterior
Jistribution. So it's ,ooJ to know low to access it. Most of tle moJel
fittin, commanJs in l support extractin, tlis variance-covariance ma-
trix witl tle commanJ vcov. However, vcov Joes not work witl op11m
Jirectly. Don't worry: Wlen we neeJ tlis matrix, you'll see exactly low
to use it. /nJ since you just suffereJ tlrou,l all tlat junk about seconJ
Jerivatives, in principle you coulJ compute tlis matrix yourself.
2.5.+.7. &TUJNBUJOH UIF DVSWBUVSF OVNFSJDBMMZ ln most cases, wlat op-
11m anJ me2 are Join, is usin, a numerical teclnique to approximate
tle seconJ Jerivatives. Tley aren't Join, analytical calculus. Still,
sometimes neitler op11m nor me2 can mana,e to compute estimates of
tle seconJ Jerivatives. Tlat Joesn't mean all is lost, lowever. Some-
times you can Jo tle estimate yourself, witl a similar numerical ap-
proacl. r maybe you just want to know low op11m is Join, it. Here's
some coJe to estimate tle seconJ Jerivate of tle lo,-likelilooJ numer-
ically, for our proportion of water problem:
pw
pw 0.024
[1] 0.1549193
sqrt( 0.024 )
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 81
Jensity for a normal is:
-(Y|,
2
) =
1

2
2
exp

(Y )
2
2
2

,
wlere Y is an observeJ value, is tle mean, anJ
2
is tle variance.
We're treatin, tlis as tle likelilooJ of tle observeJ mean Y, wlen tle
true mean is anJ tle variance of tle likelilooJ function is
2
. (lf you
Jon't reco,nize tlis formula, or Jon't unJerstanJ wlere it comes from,
tlat's okay. You'll ,et a crasl course in tlis important Jensity in a later
clapter.) Tle lo,-likelilooJ is tlen:
lo, - =
(Y )
2
2
2

1
2
lo,(2) lo,().
Now to compute tle curvature. Tle Jerivative witl respect to Y is:
lo, -
Y
=
Y

2
.
/nJ finally tle seconJ Jerivative, yielJin, tle curvature, is:

2
lo, -
Y
2
=
1

2
.
Solvin, for
2
, tle variance of tle normal probability Jensity, we ,et:

2
=
1

2
lo, -/Y
2
.
ln otler worJs, tle variance of tle posterior is approximately-if tle
posterior is approximately normal-tle reciprocal of tle seconJ Jeriv-
ative of tle ne,ative lo,-likelilooJ. So if we can estimate tle seconJ
Jerivative of tle ne,ative lo,-likelilooJ, all we lave to Jo is JiviJe one
by it to ,et an estimate of tle variance of tle posterior Jistribution.
Lack to our proportion of water estimate,

Q
8
= O.6. We computeJ
tle curvature of tle lo,-likelilooJ at tle maximum likelilooJ estimate
to be about +1.667. So it follows tlen tlat our estimate of tle variance
of posterior Jistribution for

Q
8
is:

1
+1.667

O.O2+.
Most functions in l use tle stanJarJ Jeviation of a normal, , insteaJ
of tle variance,
2
. Tle stanJarJ Jeviation is just tle square root of
tle variance, =

2
. Tle stanJarJ Jeviation lere is tlerefore about
O.15+9-tlat's tle stanJarJ error fromtle coefficients table proJuceJ by
summaJy{pw.me2). So if we want to finJ tle values of Q
8
tlat enclose
95% of tle probabilities of moJels, tlen we just ask l:
Quadratic approximation

Buyer beware:

Quadratic approximation gets better as


sample gets larger.

But when estimate is near a boundary, QA


especially bad.
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
MLE at 0.6
likelihood
quadratic
approximation
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
20
MLE at 0.6
likelihood
quadratic
approximation
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
20
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
100
MLE at 0.6
likelihood
quadratic
approximation
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
20
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
100
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
MLE at 0.6 MLE at 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
20
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
100
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
MLE at 0.6 MLE at 0.1
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
10
20
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
100
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
1
0
p
W
n
e
g
a
t
i
v
e

l
o
g
-
l
i
k
e
l
i
h
o
o
d
MLE at 0.6 MLE at 0.1
Quadratic approximation

When safe to use?

When sample is large relative to complexity of


model

When MLE is far from a boundary

A great start for MCMC

Usually all people report in journals, so good to


know Bayesian interpretation
Sampling from the posterior

Incredibly useful to sample randomly from the


posterior

Visualize uncertainty

Compute confidence intervals

Simulate observations

MCMC produces only samples

Above all, easier to think with samples


Sampling from the posterior

Recipe:
1. Estimate posterior, defining probabilities of
different models (parameter values)
2. Sample with replacement from posterior
3. Compute stuff from samples
Sampling from the posterior
72 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
(2) Sample repeateJly from tle posterior Jistribution. You enJ up
witl a list of parameter values. lf you sample enou,l, tle pro-
portion of eacl parameter value in tlis list will conver,e to tle
posterior probability.
(3) linally, use tle samples fromtle posterior to make calculations.
Tle main tasks we'll use tlese samples for is to calculate confiJence
intervals anJ to simulate samplin, of new Jata.
ln simple one parameter cases, like tle proportion of water problem,
all tlis samplin, may seem unnecessary. lt's relatively easy to ,et exact
probability statements for sucl simple formal Jistributions. Lut in later
clapters, wlen tlere are many parameters in tle moJels, it won't be so
easy. ln tlose cases, tle posterior Jensities of Jifferent parameters will be
correlateJ witl one anotler. Tlen accurately reflectin, tle joint impact
of tlese correlations on uncertainty is usually well beyonJ tle typical
matlematical trainin, of a typical natural or social scientist. However,
once you learn tle samplin, approacl, it extenJs into more complex
moJels quite easily. /nJ a,ain, once you learn tlis stuff, it prepares you
automatically to analyze MCMC output.
2.5.2.2. 4BNQMJOH QSPQPSUJPOT PG XBUFS We neeJ an example to make
more sense of tlis. ln tle proportion of water example, we can compute
tle naive posterior by usin, tle likelilooJ function. You'll compute it
at intervals of O.OO1, usin, coJe you're familiar witl by now:
l coJe
2.2+
modes <- seq{1Jom=0,1o=1,by=0.001)
pos1 <- db1hom{ 6 , s1ze=10 , pJob=modes )
Tle symbol pos1 now contains likelilooJs for all of tle moJels in mod-
es. Tlese likelilooJs are proportional to tle naive posterior probabil-
ities. Co aleaJ anJ plot tle curve tlese likelilooJs imply:
l coJe
2.25
po1{ modes , pos1 , 1ype="" )
Tle function isn't really a proper posterior Jistribution, because it lasn't
been normalizeJ so tlat it sums to one. Lut tlat isn't an obstacle for
wlat we're ,oin, to Jo, wlicl is to use tle relative proportions of tle
likelilooJs to sample moJels.
3+
To Jraw 1O-tlousanJ ranJom samples
of moJels from tlis un-normalizeJ naive posterior Jistribution, we can
make use of tle lanJy sampe commanJ, wlicl is specializeJ for just
tlis kinJ of task:
0 200 400 600 800
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Index
m
o
d
e
l
s
plot(models)
0 200 400 600 800
0
.
0
0
0
.
1
0
0
.
2
0
Index
p
o
s
t
plot(post)
72 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
(2) Sample repeateJly from tle posterior Jistribution. You enJ up
witl a list of parameter values. lf you sample enou,l, tle pro-
portion of eacl parameter value in tlis list will conver,e to tle
posterior probability.
(3) linally, use tle samples fromtle posterior to make calculations.
Tle main tasks we'll use tlese samples for is to calculate confiJence
intervals anJ to simulate samplin, of new Jata.
ln simple one parameter cases, like tle proportion of water problem,
all tlis samplin, may seem unnecessary. lt's relatively easy to ,et exact
probability statements for sucl simple formal Jistributions. Lut in later
clapters, wlen tlere are many parameters in tle moJels, it won't be so
easy. ln tlose cases, tle posterior Jensities of Jifferent parameters will be
correlateJ witl one anotler. Tlen accurately reflectin, tle joint impact
of tlese correlations on uncertainty is usually well beyonJ tle typical
matlematical trainin, of a typical natural or social scientist. However,
once you learn tle samplin, approacl, it extenJs into more complex
moJels quite easily. /nJ a,ain, once you learn tlis stuff, it prepares you
automatically to analyze MCMC output.
2.5.2.2. 4BNQMJOH QSPQPSUJPOT PG XBUFS We neeJ an example to make
more sense of tlis. ln tle proportion of water example, we can compute
tle naive posterior by usin, tle likelilooJ function. You'll compute it
at intervals of O.OO1, usin, coJe you're familiar witl by now:
l coJe
2.2+
modes <- seq{1Jom=0,1o=1,by=0.001)
pos1 <- db1hom{ 6 , s1ze=10 , pJob=modes )
Tle symbol pos1 now contains likelilooJs for all of tle moJels in mod-
es. Tlese likelilooJs are proportional to tle naive posterior probabil-
ities. Co aleaJ anJ plot tle curve tlese likelilooJs imply:
l coJe
2.25
po1{ modes , pos1 , 1ype="" )
Tle function isn't really a proper posterior Jistribution, because it lasn't
been normalizeJ so tlat it sums to one. Lut tlat isn't an obstacle for
wlat we're ,oin, to Jo, wlicl is to use tle relative proportions of tle
likelilooJs to sample moJels.
3+
To Jraw 1O-tlousanJ ranJom samples
of moJels from tlis un-normalizeJ naive posterior Jistribution, we can
make use of tle lanJy sampe commanJ, wlicl is specializeJ for just
tlis kinJ of task:
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 73
0 4000 8000
0
.
0
0
.
4
0
.
8
sample number
p
W
0.00 0.50 1.00
0
.
0
1
.
0
2
.
0
3
.
0
p
W
D
e
n
s
i
t
y
'JHVSF Samples from tle naive posterior of Q
8
. n
tle left, tle inJiviJual samples are orJereJ alon, tle
lorizontal axis. Density of tle points is ,reatest near
Q
8
= O.6, tle maximum likelilooJ estimate. n tle
ri,lt, plottin, tle Jensity (similar to tle listo,ram) of
tlese samples in blue slows an excellent approximation
to tle actual naive posterior, slown in black.
l coJe
2.26
sampes.pw <- sampe{ modes , s1ze=10000 , pJob=pos1 ,
Jepace=TRbE )
You just tolJ lto Jraw1O-tlousanJ values of Q
8
ranJomly, in proportion
to tleir posterior probabilities. Tle commanJ sampe will automatically
ensure tlat pos1 is normalizeJ, so tlat's wly you JiJn't lave to Jo it
yourself. Values of Q
8
may appear more tlan once in tle resultin, list
s1m.modes, anJ tle Jepace=TRbE parameter in tle coJe ensures tlis.
'JHVSF Jemonstrates tle relationslip between tlese samples anJ
tle posterior probabilities tlemselves. ln tle leftlanJ plot, l've plot-
teJ eacl sample in sampes.pw, witl sample number on tle lorizon-
tal axis anJ tle sampleJ value itself on tle vertical axis. Tle points
scatter all over, because tlere is consiJerable uncertainty as tle value
of Q
8
. Lut tley Jo cluster arounJ Q
8
= O.6, tle maximum likelilooJ
estimate. Lacl value of Q
8
appears in tlese samples witl proportion ap-
proximately equal to its posterior probability. ln tle ri,ltlanJ plot, l
slow tle Jensity resultin, from tle point samples, in blue. lt is ja,,eJ,
Sampling from the posterior
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 73
0 4000 8000
0
.
0
0
.
4
0
.
8
sample number
p
W
0.00 0.50 1.00
0
.
0
1
.
0
2
.
0
3
.
0
p
W
D
e
n
s
i
t
y
'JHVSF Samples from tle naive posterior of Q
8
. n
tle left, tle inJiviJual samples are orJereJ alon, tle
lorizontal axis. Density of tle points is ,reatest near
Q
8
= O.6, tle maximum likelilooJ estimate. n tle
ri,lt, plottin, tle Jensity (similar to tle listo,ram) of
tlese samples in blue slows an excellent approximation
to tle actual naive posterior, slown in black.
l coJe
2.26
sampes.pw <- sampe{ modes , s1ze=10000 , pJob=pos1 ,
Jepace=TRbE )
You just tolJ lto Jraw1O-tlousanJ values of Q
8
ranJomly, in proportion
to tleir posterior probabilities. Tle commanJ sampe will automatically
ensure tlat pos1 is normalizeJ, so tlat's wly you JiJn't lave to Jo it
yourself. Values of Q
8
may appear more tlan once in tle resultin, list
s1m.modes, anJ tle Jepace=TRbE parameter in tle coJe ensures tlis.
'JHVSF Jemonstrates tle relationslip between tlese samples anJ
tle posterior probabilities tlemselves. ln tle leftlanJ plot, l've plot-
teJ eacl sample in sampes.pw, witl sample number on tle lorizon-
tal axis anJ tle sampleJ value itself on tle vertical axis. Tle points
scatter all over, because tlere is consiJerable uncertainty as tle value
of Q
8
. Lut tley Jo cluster arounJ Q
8
= O.6, tle maximum likelilooJ
estimate. Lacl value of Q
8
appears in tlese samples witl proportion ap-
proximately equal to its posterior probability. ln tle ri,ltlanJ plot, l
slow tle Jensity resultin, from tle point samples, in blue. lt is ja,,eJ,
Figure 2.9. Samples from the naive posterior.
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 73
0 4000 8000
0
.
0
0
.
4
0
.
8
sample number
p
W
0.00 0.50 1.00
0
.
0
1
.
0
2
.
0
3
.
0
p
W
D
e
n
s
i
t
y
'JHVSF Samples from tle naive posterior of Q
8
. n
tle left, tle inJiviJual samples are orJereJ alon, tle
lorizontal axis. Density of tle points is ,reatest near
Q
8
= O.6, tle maximum likelilooJ estimate. n tle
ri,lt, plottin, tle Jensity (similar to tle listo,ram) of
tlese samples in blue slows an excellent approximation
to tle actual naive posterior, slown in black.
l coJe
2.26
sampes.pw <- sampe{ modes , s1ze=10000 , pJob=pos1 ,
Jepace=TRbE )
You just tolJ lto Jraw1O-tlousanJ values of Q
8
ranJomly, in proportion
to tleir posterior probabilities. Tle commanJ sampe will automatically
ensure tlat pos1 is normalizeJ, so tlat's wly you JiJn't lave to Jo it
yourself. Values of Q
8
may appear more tlan once in tle resultin, list
s1m.modes, anJ tle Jepace=TRbE parameter in tle coJe ensures tlis.
'JHVSF Jemonstrates tle relationslip between tlese samples anJ
tle posterior probabilities tlemselves. ln tle leftlanJ plot, l've plot-
teJ eacl sample in sampes.pw, witl sample number on tle lorizon-
tal axis anJ tle sampleJ value itself on tle vertical axis. Tle points
scatter all over, because tlere is consiJerable uncertainty as tle value
of Q
8
. Lut tley Jo cluster arounJ Q
8
= O.6, tle maximum likelilooJ
estimate. Lacl value of Q
8
appears in tlese samples witl proportion ap-
proximately equal to its posterior probability. ln tle ri,ltlanJ plot, l
slow tle Jensity resultin, from tle point samples, in blue. lt is ja,,eJ,
plot dens
Confidence intervals

Confidence intervals concise way to summarize


shape of posterior.

Two common sorts:


1. Predefined boundaries
2. Predefined probability mass
76 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0.0 0.4 0.8
0
.
0
0
0
0
.
0
0
2
p
W
p
r
o
b
a
b
i
l
i
t
y
27%
0.0 0.4 0.8
0
.
0
0
0
0
.
0
0
2
p
W
p
r
o
b
a
b
i
l
i
t
y
95%
'JHVSF Two kinJs of confiJence interval. n tle
left, an interval witl preJefineJ bounJaries, in tlis case
seekin, all of tle probability mass between Q
8
= O anJ
Q
8
= O.5. n tle ri,lt, an interval witl preJefineJ prob-
ability mass but variable bounJaries.
2728/1OOOO O.27. lf you want to explicitly Jefine tle lower bounJ of
tle interval too, you neeJ coJe like tlis:
l coJe
2.28
ehg1h{ sampes.pw| sampes.pw >= 0 & sampes.pw <= 0.5 ] )
You can verify tlat tlis ,ives tle same answer, 2728. l plot tle re,ion
JefineJ by tlis interval in 'JHVSF . ne tlin, you coulJ say from
sucl a calculation is tlat tle oJJs tlat more of tle Lartl is covereJ
witl water tlan witl lanJ are (1 O.27)/O.27 2.7. (r ratler, tle
oJJs tlat tle best approximation assumes more water tlan lanJ is about
2.7.)
2.5.3.2. 7BSJBCMF JOUFSWBMT PG QSFEFGJOFE NBTT More common in sci-
ence journals is tle seconJ type of interval, one witl preJefineJ prob-
ability mass. Tlese intervals are useful for communicatin, tle slape of
tle posterior, wlen a full picture of it is eitler cumbersome or unneces-
sary. Tle ,oal is to Jefine tle lower anJ upper bounJaries of an interval
of parameter values tlat incluJes some preJefineJ probability of all moJ-
els. Tle most common preJefineJ probability is O.95, corresponJin, to
a 95% confiJence interval. Tlere are actually a few Jifferent ways to
Jefine sucl an interval, altlou,l tle Jifferences rarely matter in any
Figure 2.10.
Defined boundaries Defined mass
Easy to compute both from samples
Predictive checks

Posterior probability never enough

Even the best model might make terrible


predictions

Also want to check model assumptions

Predictive checks: Can use samples from


posterior to simulate observations
2.5. CNllDLNCL: MlL TH/N THL M/XlMLM 91
low to ,et samples from tle posterior Jensity, naturally incluJin, all of
tle variances anJ covariances amon, parameters, but witlout worryin,
about any of tlis quaJratic approximation stuff. Tlat will be fantastic,
anJ it's one reason wly MCMCis so powerful. Tle cost, lowever, is tlat
MCMC takes a lot more computer time tlan maximum likelilooJ anJ
quaJratic approximation Joes. /nJ in many cases, tle two approacles
yielJ functionally iJentical inferences. Tlere are many important moJ-
els, lowever, for wlicl we can't compute tle necessary likelilooJs or
for wlicl we can't trust tle quaJratic approximation. ln tlose cases,
MCMC saves your bacon. Tlis business of applieJ statistical comput-
in, is all about traJeoffs. You want to know tle costs anJ compromises
of eacl metloJ, so you'll know wlen to put Jown one metloJ anJ pick
up anotler.
1SFEJDUJWF DIFDLT To evaluate tle ran,e preJictions implieJ by
tle naive posterior, we ,enerate a sample of Jata. Lacl value of Q
8
im-
plies its own Jistribution of preJictions, anJ we want to avera,e over
tlese unique Jistributions, usin, tle posterior probabilies. Tlis sounJs
pretty Jauntin,, wlen spoken of matlematically. Lut a,ain, once we
lave empirical" samples from tle posterior, it's just like summarizin,
Jata. lirst, we use eacl sampleJ posterior to simulate samplin,. Tlis
collection will automatically be avera,eJ appropriately over tle poste-
rior. Tlen we summarize tle collection of samples.
Tlere is a ranJom binomial commanJ in l expressly for tle purpose
of simulatin, samples from a binomial process. ln fact, all of tle built-
in likelilooJ functions in l, like db1hom, lave corresponJin, ranJom
functions tlat accept parameter values anJ proJuce simulateJ samples
of Jata. Here's low to use Jb1hom, tle ranJom function corresponJin,
to db1hom:
l coJe
2.+1
modes <- seq{1Jom=0,1o=1,by=0.01)
pos1 <- db1hom{ 6 , s1ze=10 , pJob=modes )
s1m.modes <- sampe{ modes , s1ze=10000 , Jepace=TRbE , pJob=pos1 )
s1m.da1a <- Jb1hom{ h=10000 , s1ze=10 , pJob=s1m.modes )
Wlat sloulJ you Jo witl tlese fake" Jata now` Tlat JepenJs upon
your question, but a typical use mi,lt be to visualize tle pattern of pre-
Jictions anJ ,enerate QSFEJDUJPO JOUFSWBMT. lreJiction intervals are like
confiJence intervals, but tley are ran,es of Jata ratler tlan ran,es of
moJels containin, a specifieJ proportion. Let's visualize tle simulation
Jata we just computeJ, before turnin, to computin, preJiction intervals
92 2. MDLLS, LSTlM/TlN /ND CNllDLNCL
0 2 4 6 8 10
0
1
0
0
0
2
5
0
0
n
W
F
r
e
q
u
e
n
c
y
'JHVSF SimulateJ Jata, us-
in, botl tle entire posterior Jen-
sity (blue) anJ just tle max-
imum likelilooJ estimate (or-
an,e). MoJel preJictions baseJ
only on tle maximum likelilooJ
can substantially unJerestimate
tle uncertainty.
in tlis case. ln later clapters, you'll see tlat tlese concepts lave lu,e
value in interpretin, tle meanin, of moJel fits.
To plot tle simulateJ samples, you coulJ use tle stanJarJ h1s1 com-
manJ in l, but for Jiscrete Jata of tlis sort, better to use tle s1mpeh1s1
commanJ tlat is part of tle coJe tlat accompanies tlis book:
l coJe
2.+2
s1mpeh1s1{ s1m.da1a )
You can make a similar plot for simulateJ Jata proJuceJ from only tle
maximum likelilooJ estimate:
l coJe
2.+3
s1m.da1a.me <- Jb1hom{ h=10000 , s1ze=10 , pJob=0.6 )
s1mpeh1s1{ s1m.da1a.me )
ln li,ure 2.11, l slow botl of tlese plots, stackeJ to,etler for ease of
comparison. Tle blue bars slow tle frequencies of Jifferent counts of
water, O
8
, ,enerateJ by avera,in, over tle uncertainty in tle entire
naive posterior Jensity, tle profile likelilooJ. Tlese are tle values
in s1m.da1a. Tle oran,e bars are tle frequencies of counts simulateJ
from only tle maximum likelilooJ estimate. Tlese are tle values in
s1m.da1a.me. You can see ri,lt away tlat tle blue Jistribution is mucl
flatter tlan tle oran,e. Tlis is a result of lonestly incorporatin, tle un-
certainty as tle tle true value of Q
8
into our preJictions. Lsin, only tle
maximum likelilooJ moJel, Q
8
= O.6, proJuces a narrower Jistribution
of preJictions, buncleJ up arounJ O
8
= 6, but tlis ,reater confiJence
is just an illusion of i,norin, tle uncertainty in tle entire posterior. lt
Figure 2.12.
MLE only
Entire posterior
Predictive checks

Something like a significance test, but not

No universally best way to evaluate adequacy of


model-based predictions

No way to justify always using a threshold like


5%

Good predictive checks always depend upon


purpose and imagination

S-ar putea să vă placă și