Sunteți pe pagina 1din 30

c

HYON-JUNG
KIM, 2015

Bayesian statistics
Bayesian: named after Thomas Bayes (1702-1761)
It provides a natural, intuitively plausible way to think about statistical problems by revising
previous information based on data observations.

Probability
Randomness
- Most statistical modeling is based on an assumption of random observations from
some probability distribution
- The objective of statistical modeling is to discover the systematic component and
filter out the random noise.
1. Estimate parameters of the distribution generating the data
2. Test hypotheses about the parameters
3. Predict future occurrences of data of the same type
4. Choose appropriate action given what we can learn about the phenomenon generating the data
- It may be dicult or impossible to distinguish between very complex or chaotic
deterministic phenomena and truly random phenomena (The more data we have, the
more complex our models tend to grow. )
Definition: Probability P (A) is a measure of the chance that an event A will happen.
Axioms of Probability
1. P (A) 0 for any event A
2. P (S) = 1 where S is an universal set.
3. P ( A) = 1 P (A)
4. If A and B have no outcomes in common then P (A B) = P (A) + P (B).
PAGE 1

c
HYON-JUNG
KIM, 2015

Interpretations of probability
1. Classical: Probability is a ratio of favorable cases to total (equipossible) cases
The fundamental assumption is that the game is fair (based on the game theory) and
all outcomes are equally likely.
2. Frequentist: Probability is the limiting value of the frequency of some event as the
number of trials becomes infinite.
It can be legitimately applied only to repeatable problems and is believed as an objective property in the real world.
3. Subjectivist: Probabilities may represent some numerical values assigned as to
some degrees of personal belief. Most events in life are not repeatable. Probabilities
are essentially conditional and there is no one correct probability.
Frequency probability inference:
- Data are drawn from a distribution of known form but with an unknown parameter
and often this distribution arises from explicit randomization.
- Inferences regard the data as random and the parameter as fixed (even though the
data are known and the parameter is unknown)
Subject probability inference:
- Probability distributions are assumed for the unknown parameters and for the observations (i.e. both parameters and observations are random quantities).
- Inferences are based on the prior distribution and the observed data.
Comparison/Generality
- Frequentists are disturbed by the dependence of the posterior results on the subjective
prior distribution
- Bayesians say that the prior distribution is not the only subjective element in an
analysis. The assumptions about the sampling distributions are also subjective.

PAGE 2

c
HYON-JUNG
KIM, 2015

- Whose probability distribution should be used? When there are enough data, a
good Bayesian analysis and a good frequentist analysis will tend to agree. If the
results are sensitive to the prior information, a Bayesian analyst is obligated to report
this sensitivity and to present range of results obtained from a wide range of prior
information.
- Bayesians can often handle problems the frequentist approach cannot. Bayesians
often apply frequentist techniques but with a Bayesian interpretation. Most untrained
people interpret results in the Bayesian way more easily. (Often the Bayesian answer
is what the decision maker really wants to hear.)
Conditional Probability: the conditional probability of B given A is
P (B|A) = P (A B)/P (A),
where P (A B) = P (AB) is the joint probability that both A and B occur.
Independence of events:
A and B are independent if P (A|B) = P (A) or P (B|A) = P (B).
Multiplication Rule:
P (AB) = P (A)P (B|A).
Then, by definition:
P (A|B) = P (AB)/P (B) = P (B|A)P (A)/P (B)
*** This is the Bayes theorem.
Applying the Law of Total Probability:
P (B) = P (B|A)P (A) + P (B| A)P ( A)
So
P (A|B) = P (B|A)P (A)/(P (B|A)P (A) + P (B| A)P ( A))
This result is referred to as the expanded form of the Bayes theorem.

PAGE 3

c
HYON-JUNG
KIM, 2015

Example: Diagnostic test for a rare disease


A young person was diagnosed as having a certain disease with a medical test (i.e. his
test result was positive). Suppose that the probability of having such disease in general
is about 0.0001 and suppose further that the probability of positive test result when
the person really has such disease is 0.9. Also, the probability of positive test result
when the person does not have such disease is 0.01.
What is the probability that the person has the disease given that he has a positive
test response?

Example: Icy roads


Inspector Smith is waiting for Holmes and Watson who are both late for an appointment. Smith is worried that if the roads are icy, one or both of them may have crashed
his car.
P( roads are icy)= .7

P( roads are not icy)=.3

P(One crashes | icy) =.8

P( One crashes | not icy)= .1

i) What is the probability that Holmes crashes his car?

Suddenly Smith learns that Watson has crashed his car.


ii) What is the probability that Holmes crashes his car given the information that
Watson has crashed?

PAGE 4

c
HYON-JUNG
KIM, 2015

Bayesian Inference

Random variables: represent outcomes of an uncertain phenomenon


- Formally, it is a function from the sample space to the set of outcomes.
- In Bayesian thinking, the sample space can be thought of as a set of possible worlds
in which the random variable may have dierent values.
To a frequentist, the height of a person is not a random variable, but it is to a Bayesian
if you do not know his/her height.
Probability distribution
A probability distribution for a random variable quantifies how likely the dierent
values are and summarizes all the information about the random variable.
- Summaries of the distribution: Measures of central tendency, variability, and the
shape of distribution.
- Parameterized families of distributions: much of statistics is based on constructing
models of phenomena using parameterized families of distributions.
Observations X = (X1 , . . . , Xn ) are sampled randomly from the distribution f (X|)
where denotes the parameter of a model for the phenomenon generating the observations
e.g. Normal, Poisson, Gamma distributions, etc.
- Conditional/marginal distributions
Bayesian Inference for parameters of probability models
- Often we model a set of observations X = (X1 , X2 , ..., Xn ) as independent trials from
a probability distribution with unknown parameters. The joint distribution of X given
a parameter viewed as a function of is called the likelihood function. To draw
inferences about , a Bayesian statistician specifies a prior distribution g() for . The
Xi s are usually independent given , but are not independent marginally.

PAGE 5

c
HYON-JUNG
KIM, 2015

Bayes theorem in parametric distribution


It is essentially a formula for learning from new data. It tells us how to convert our
prior beliefs for a proposition into posterior beliefs after learning the information (in
addition to the background prior information). It is fundamental to the Bayesian
approach to statistics.
Suppose we have an initial or prior belief about a hypothesis H and suppose that we
observe some data D. Then we can calculate our revised or posterior belief about the
truth of H in the light of the data evidence D using Bayes theorem.
P (H|D) =

P (D|H)P (H)
P (D)

In parallel,
[X|][]
[X]
f (X|)g()
f (X|)g()
g(|X) =
=
f (X)
f (X|)g()d
[|X] =

for continuous

i.e. Posterior distribution : likelihood prior


- The marginal likelihood f (X) is also called the predictive distribution for observations. (Predictive distribution incorporates both uncertainty about X given and
uncertainty about . We use f (X) to predict the observations we expect before observing them.)
- A fundamental identity of Bayesian inference: f (X|)g() = f (X)g(|X)
- Frequentists condition on parameters and base inferences on data distribution. Bayesians
condition on knowns and treat unknowns probabilistically.
Learning
- Free (prior) information can never hurt. Decide what information to collect. Determine cost and information gain from sample.
- The general principle is that in elaboration we try to use probability theory to construct elaborative measurements of probability that are dicult to measure and therefore liable to large measurement errors out of other probabilities that are easier to
measure and hence hopefully more accurate.
PAGE 6

c
HYON-JUNG
KIM, 2015

Likelihood

The problem of statistical inference is to use observed data to learn about unknown
features of the process that generated those data.
In order to make inference, it is essential to describe the link between X and , and
this is done through a statistical model. The purpose of the statistical model is to
describe this relationship by deriving the probability P (X|) with which we would
have obtained the observed data X for any value of the unknown parameter vector.
Definition: Likelihood function L( : X) is defined as any function of such that
L( : X) = c P (X|) for some constant c.
Likelihood may not be enough for inference. The Bayesian approach is based also on
some prior information.
Prior distribution ( g() or P () ):
formulates your prior beliefs about the parameters.
Note that frequency probability is not able to represent such beliefs since parameters
are referred as unknown but not random. The prior distribution is the major source
of disagreement between two approaches - Bayesian and frequentists
Posterior distribution ( g(|X) or P (|X) ):
presents the probability distribution of the unknown parameter after we take the prior
information and learn from the data
Note again that the posterior distribution has no meaning in the frequentist theory.
Bayesian Methods for Inference
i) Model a set of observations with a probability distribution with unknown parameters.
ii) Specify prior distributions for the unknown parameters.
iii) Use the Bayes theorem to combine these two parts into the posterior distribution.
iv) Use the posterior distribution to draw inferences about the unknown parameters of
interest.
PAGE 7

c
HYON-JUNG
KIM, 2015

Example: Prior to Posterior


Suppose that there are three states of nature A1 , A2 , A3 and two possible data D1 , D2 :
What happens to our belief about A1 , A2 , A3 if we observe D1 ? (if we observe D2 ?)
P (D|A)
D1

D2

Prior

A1

0.0

1.0

0.3

A2

0.7

0.3

0.5

A3

0.2

0.8

0.2

Example:
A black male mouse is mated with a female black mouse whose mother had a brown
coat.
B and b are alleles of the gene for coat color. The gene for black fur is given the letter
B and the gene for brown fur is given the letter b where B is the dominant allele to b.
The male and female have a litter with 5 pups that are all black. We want to determine
the males genotype. The prior information suggests P (BB) = 1/3 and P (Bb) = 2/3.
What is the posterior probability that the males genotype is BB?

PAGE 8

c
HYON-JUNG
KIM, 2015

Probability Distributions:
Yi Poisson () distribution
f (yi : ) = yi e /yi ! yi = 0, 1, 2, ...
Yi exponential ( 1 ) distribution

f (yi : ) =

1 yi /
e
,

yi 0

Yi Gamma(, ) distribution
f (yi : , ) =

1
y 1

() i

exp(yi /) yi > 0,

> 0, > 0

Yi Inverse Gamma(, ) distribution


f (yi : , ) =

1
(+1)
y

() i

exp(1/(yi )) yi > 0,

> 0, > 0

Yi Normal (, 2 ) distribution
(

1
(yi )2
f (yi : , ) =
exp
,
2 2
2
2

< yi < > 0, 2 > 0.

Yi Binomial (n, p) distribution

f (yi : n, p) =

n
yi

pyi (1 p)1yi , yi = 0, 1, 2, ...

Yi Beta (,) distribution


f (yi : , ) =

( + ) 1
y (1 yi )1 , 0 < yi < 1 > 0, > 0.
()() i

PAGE 9

c
HYON-JUNG
KIM, 2015

Posterior Inference

Point Estimation: to give a single best representative estimate of


1. Posterior mean: E(|X) is the expected value of
= 0.5
2. Posterior median: a central value for . It is the value such that P ( )
3. Posterior mode: the most probable single value of . It is the value of that
maximizes P (|X).
-Comparison with frequentists point estimation : for example, unbiasedness
Interval Estimation
We wish to specify a set of values for that we believe the true parameter is likely to
lie in. This set is often thought of and constructed as an interval, but can be a more
general set (such as the union of two disjoint intervals). i.e. we look for a set S such
that
P ( S|X) =
where specifies how certain we wish to be that lies in the set. There will be typically
many such sets S. It is sensible to ask for the shortest interval (or the set with the
smallest volume) that still contains with probability .
The highest posterior density (HPD) set with probability is such an interval.
To find it, we consider drawing a horizontal line across the density function at a height
c. For any given c, the set of all values for which the posterior density p(|X) c
forms an HPD set. We simply adjust c up or down to find the HPD set with the
required probability .
- It is also called Bayesian credible intervals
- It may coincide with the frequentists confidence intervals numerically but the interpretation is entirely dierent
- Comparison with the frequentist confidence intervals
Bayesian Hypothesis testing (only briefly now but later in detail)
PAGE 10

c
HYON-JUNG
KIM, 2015

In general, frequentist methods are always based on the idea of repeated sampling, and
their properties are all long-run average properties obtained from repeated sampling.
The Bayesian approach is not restricted to just these standard kinds of inference that
the frequentist theory considers. We can use the posterior distribution to derive whatever kinds of statement seem appropriate to answer the questions that the investigator
may have.
Prior sensitivity
If the posterior is insensitive to the prior, where the prior is varied over a range that
is reasonable, believable and comprehensive, then we can be fairly confident of our
results. This usually happens when there is a lot of data or data of good quality.

Choice of Prior

It should be emphasized that if you have some real prior information you should use
it and not one of those uninformative or automatic priors.
The most important principle of prior selection is that your prior should represent the
best knowledge that you have about the parameters of the problem before looking at
the data.
Conjugacy
When the posterior distribution is in the same family of distributions as the prior
distribution, we have conjugate pairs of distributions. (We also say that the family
of distributions is closed under sampling).
Note that there are several other cases that we will take a look at later.

Example: Poisson data

Example: Binary data


PAGE 11

c
HYON-JUNG
KIM, 2015

Informative priors : subjective opinion based priors should be chosen with care in
practice.
Noninformative priors (or reference priors)
There are several ways to formulate priors that can be used in the case of no substantial
information beforehand.
- Vague priors
Sometimes one may have real information that can lead to a prior but the prior will
still be vague or spread out.
- Jereys priors : H. Jereys (1961) proposed a general way for choosing priors.
p() = |I( : X)|1/2
where I( : X) is the Fisher information for p(x|). Note that may be a vector of
parameters. Recall

log f (X|)
I( : X) = E

]2

2 log f (X|)
= E
2

- Improper priors : do not have legitimate probability density


NOTE: Improper priors can be used only if they induce proper posterior distributions.
Mixture of Conjugates (will be discussed later)

Likelihood principle
Bayesian inference is based on the likelihood function, combined with some prior information about the parameter.
Likelihood principle can be stated as : In making inferences or decisions about the
parameters of a problem after observing data, all relevant information is contained in
the likelihood function for the observed data.
Furthermore, two likelihood functions contain the same information about the parameter if they are proportional to each other as a function of the parameter.
Example: Binomial data
PAGE 12

c
HYON-JUNG
KIM, 2015

EXAMPLE : Transmission Error Data


Data: 6 one-hour observation periods with 1, 0, 1, 2, 1, 0 transmission errors. The
data are modeled as a sample from a Poisson distribution with mean rate . The data
on previous system established the mean error rate as 1.6 errors per hour. The new
system has a design goal of cutting this rate in half to .8 errors per hour. An expert
has a prior belief that the median of the new system should be close to .8.
Consider a gamma distribution with = 2.1 and = 2 as the prior distribution. Does
the new system seem to be able to achieve the design goal?

PAGE 13

c
HYON-JUNG
KIM, 2015

EXAMPLE : Cancer trials


A new treatment protocol is proposed for a particular form of cancer. The measure of
success will be the proportion of patients that survive longer than six months after diagnosis. With the present treatment, this success rate is 40%. Letting be the success
rate of the new treatment, a doctor assesses her prior beliefs about as follows. She
judges that her expectation of is E() = 0.45, and her standard deviation is 0.07.
Assume that her beliefs can be represented by a beta distribution.

A clinical trial of the new treatment protocol is carried out. Out of 70 patients in the
trial, 34 survive beyond six months from diagnosis. Should the new treatment protocol
replace the old one?

PAGE 14

c
HYON-JUNG
KIM, 2015

Normal Samples
Most widely-used models in statistics
simple analytical formulas, good first cut, easy integrations, good approximation for
many models
Assume that x1 , ...xn are independent samples from a normal distribution with mean
and variance : N (, )
The likelihood function is
p(x1 , ...xn |, ) =

(xi )2
exp
,
2
2

1
ni=1

< xi < > 0, > 0.

Note that with the normal example, completing square technique will enable us to
track down analytic formulas for posterior distributions.

Case 1. is known. May have some prior information on .


Choose a conjugate prior for : N (0 , 02 )
Posterior : Normal with
mean =

0 /02 + n
x/
2
1/0 + n/

variance =

1/02

1
+ n/

Case 2. is known. is unknown with no prior information.


Take a flat prior for for example: p() = 1
Posterior : N (
x, n )
Case 3. is known.
Prior for IG(a, b)
Posterior for : IG(a + n/2, b +

(xi )2
)
2

Example : Cavendish data


PAGE 15

c
HYON-JUNG
KIM, 2015

The British physicist Henry Cavendish made 23 observations of the Earths density
(1731-1810):
5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34, 5.79, 5.10,
5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68, 5.85
Min.

Mean

Max.

std.

5.100

5.485

5.850

0.192

Suppose that we model these as a sample from N (, 0.25) distribution, where is the
true density of the Earth. Consider that someone has a prior belief that N (5, 4).
A specific interest is to evaluate the hypothesis that > 5.5.

PAGE 16

c
HYON-JUNG
KIM, 2015

Predictive Inference

It is not always the parameters that are of primary interest. Often, the real problem
is to predict what will be observed in the future.
Predictive inference consists of making inference statements about the future observations. There are obvious diculties in trying to fit predictive inference into the
frequentist framework.
Bayesian inference embraces predictive inference naturally. Parameters and future
observations are both random variables and all we need to do is to find the relevant
posterior distribution.
We wish to predict a future observation, say y given that we have observed X. Then,
the Bayesian inference about y would be based on its posterior distribution, P (y|X).

P (y|X) =

P (y, |X)d =

P (y|, X)P (|X)d

Examples:
- Prediction of binary data
- Poisson data
- Normal model

- Continued example from Cancer trial data


Predict how many out of the next m = 20 patients will survive for 6 months.

PAGE 17

c
HYON-JUNG
KIM, 2015

8
8.1

More choices of prior


Mixtures of conjugacy

We can approximate a prior distribution by a mixture of conjugate prior distributions


to any required level of accuracy.
A mixture distribution for a random variable is a weighted sum (average) of several
probability density functions (p.d.f.s):

p(x) =

aj pj (x)

j=1

where pj (x) is a probability density function, aj > 0j, and

n
j=1

aj = 1.

A prior for that is a mixture distribution of several conjugate priors has the form:
p() =

aj pj ()

j=1

where pj () is a dierent member of the conjugate family.


The corresponding posterior distribution p(|X) is also a mixture of conjugates.

Example: Clinical trial data


Consider a mixture prior of Beta(22.28, 27.23) with Beta (40, 60).

PAGE 18

c
HYON-JUNG
KIM, 2015

8.2 Conditional conjugacy

8.2

Conditional conjugacy

(Note that this section is provided as a side information)


Definition of conditional conjugacy: the prior distribution p(, ) is said to be conditionally conjugate to the likelihood L(, : X) given if it is conjugate when we
consider both as functions of for fixed .
Conditional conjugacy given facilitates posterior analysis conditional on , and
means that we can use a combination of mathematical and computational methods
to obtain inferences. We can apply mathematical methods first to obtain results for
conditional on , thereby expressing everything in terms of alone. Then we must
use numerical methods to do the inferences, but by expressing things in terms of just
we have reduced the dimension of the numerical computation.

Example: Normal sample with unknown mean and unknown variance

PAGE 19

8.3 Numerical methods in Bayesian analysis

8.3

c
HYON-JUNG
KIM, 2015

Numerical methods in Bayesian analysis

In the cases where we are obliged to use non-conjugate, or non-simple priors, we often
end up with the posterior distribution that is not known, non-standard distribution.
Then, there is no formulae to apply to draw posterior inference and one need to rely
on some numerical methods. There are various numerical methods that can be used
and those that are most commonly used will be presented in detail later in the course
with R/WinBUGS.

- Numerical Integration :
Integrals can be computed numerically. The simple methods of numerical integration
at least in one dimension are generally found.
Curse of Dimensions
The major development in 1990s has been a completely dierent computational approach known as Markov chain Monte Carlo (MCMC). We will cover this topic later
in the course in detail.

- Modes:
Computing a mode in general, requires maximization, a process that can be computationally ecient even in quite high dimensions.
Often, to find a mode, we need to obtain the marginal posterior density, which means
integrating out any other parameters. So integration is not avoided completely.

- Approximation: there is a general theorem to the eect that as the sample size gets
large the posterior distribution tends to (multivariate) normality. Therefore we may
expect that whenever there is a reasonably substantial quantity of the data we may
be able to compute inferences through approximating the posterior distribution by a
(multivariate) normal distribution.

PAGE 20

c
HYON-JUNG
KIM, 2015

Normal sample with unknown mean and variance

More probability distributions


- Yi Inverse chi-square (p, q) distribution
( )p

q
1
f (yi : p, q) = p
( 2 ) 2
E[Yi ] = q/(p 2),

V ar[Yi ] =

( p +1)
yi 2

2q 2
(p2)2 (p4)

q
exp
2yi

yi > 0,

p > 0, q > 0

if p > 4

- Yi tp (m, w): t-distribution with mean m, scale parameter w, p degrees of freedom


1 ((p + 1)/2)
f (yi : m, w, p) =
(p + w1 (yi m)2 )(p+1)/2
w (p/2)

yi > 0

- (, ) Normal inverse chi-square distribution NIC(p, q, m, v)


(

(q/2) 2
1
f (, : p, q, m, v) =
(p+3)/2 exp {v 1 ( m)2 + q}
p
2
2v( 2 )
E[] = m V ar[] = vq/(p 2),

E[ ] = q/(p 2) V ar[ ] =

>0

2q 2
(p2)2 (p4)

Further properties of NIC distributions


If (, ) NIC(p, q, m, v), then
- | N (m, v )
- | IC(p + v 1 ( m)2 , q + 1)
- tp (m, qv/p)
- IC(p, q)
Case 4. Both and are unknown.
Prior: NIC (p, q, m, v) conjugate family of joint prior for and
Posterior : (, )|X NIC(p1 , q1 , m1 , v1 )
where p1 = p+n, q1 = q+S +(v+n1 )1 (
x m)2 , m1 =
Note that S =

i=1 (xi

x)2

PAGE 21

v 1 m+n
x
.
v 1 +n

and v1 = (v 1 +n)1 .

c
HYON-JUNG
KIM, 2015

p((, )|X) p(X|(, ))p(, )


(
)
(
)
1
1 1
n
2
p+3
2
2
2
exp [n( x) + S]
exp {v ( m) + q}
2
2
(
)
1
n+p+3
1
v
m
+
n
x
2
1
2

exp {(v + n)[
] + q1 }
2
v 1 + n
Implications of the NIC prior distribution
Note that the choice of NIC prior is very convenient for analysis of a normal sample,
it may not generally represent prior information adequately in practice.
The reason for this is the relationships that are imposed for and by NIC distribution.
When is larger, uncertainty for gets larger suggesting that is far from the mean.
Marginal inference about :
|X tp1 (m1 , q1 v1 /p1 )
E[|X] = m1 =
V ar[|X] =

v 1 m+n
x
v 1 +n

q1 v1
p1 2

Marginal inference about :


|X IC(p1 , q1 )
E[ |X] =

q1
p1 2

V ar[ |X] =

2q12
(p1 2)2 (p1 4)

Example: Cavendish data

PAGE 22

c
HYON-JUNG
KIM, 2015

Prediction in Normal model

Suppose that we wish to predict the mean of k future observations, i.e. Y = ki=1 Yi

We know that Y |, N (, /k)


Deriving the predictive distribution from the posterior is a hard work.
Note that |, X N (m1 , v1 )
Y |, X N (m1 , v1 + /k)
Since |X IC(p1 , q1 ) and

(Y , )|X N IC(p1 , q1 , m1 , v1 + 1/k),

Y |X tp1 (m1 , q1 (v1 + 1/k)/p1 )


Case 5. Both and are unknown. No prior information on both.
Noninformative prior: p(, ) = 1/

- Marginal posterior for : |X tn1


- Marginal posterior for : |X IC(n 1, S)

10

Two Normal samples

We often want to compare two populations on the basis of two samples: Y11 , ..., Y1n1
and Y21 , ..., Y2n2
Assume Y1i N (1 , 1 ),

i = 1, ..., n1 and Y2j N (2 , 2 ),

from Y1i s
Quantity of interest: = 1 2
i) Case 1. 1 and 2 are assumed to be known
ii) Case 2. 1 and 2 are unknown but assume that 1 = 2
iii) Case 3. 1 and 2 are unknown and 1 = 2

PAGE 23

j = 1, ..., n2 independent

c
HYON-JUNG
KIM, 2015

Frequentist approach
i)
z=

Y1 Y2 (1 2 )

1 /n1 + 2 /n2
ii)

Y1 Y2 (1 2 )
t(pooled) =
Spooled (1/n1 + 1/n2 )

where Spooled =

S1 +S2
n1 +n2 2

iii)
t =

Y1 Y2 (1 2 )

S1 /n1 + S2 /n2
Bayesian approach
i) 1 and 2 are assumed to be known
Can take independent reference priors for 1 and 2 : p(1 , 2 ) = 1
Posterior: 1 2 |Y N (Y1 Y2 , 1 /n1 + 2 /n2 )
ii) 1 = 2 unknown
n1

likelihood: p(Y |1 , 2 , ) 2
)

n2
2

exp 21 [n1 (1 Y1 )2 + S1 ]

exp 21 [n2 (2 Y2 )2 + S2 ]

a. Take independent priors uniform in 1 , 2 , : p(1 , 2 , ) = 1/


p(1 , 2 , |Y )
(

n1 +n2+2
2

exp 21 (S1 + S2 ) exp 21 [n1 (1 Y1 )2 ]

exp 21 [n2 (2 Y2 )2 ]

1 2 |Y tn1 +n2 2 (Y1 Y2 ,

S1 + S2
(1/n1 + 1/n2 ))
n1 + n2 2

b. Take normal priors for 1 and 2 conditionally on and IG (a, b) for :


p(1 , 2 , ) = p(1 | )p(2 | )p( )

+ni Yi
Then, i |Y N ( i0nni0i0+n
, ni0+ni )
i

c. Consider Linear model setting with NIC joint prior for 1 , 2 , : (later after we
discuss the linear model in Bayesian approach)
PAGE 24

c
HYON-JUNG
KIM, 2015

iii) 1 = 2 unknown
Take conjugate family of NIC distributions independently for (1 , 1 ) and (2 , 2 )
Posterior: (i , i )|Y NIC(pi , qi , mi , vi )
As before, i |Y tpi (mi , qi vi /pi )

11

Linear Models

Normal linear models with conjugate prior distributions:


Y = X +
where X: n r matrix of known constants, : r 1 vector of regression coecients,
and N (0, I)

Likelihood: Y |, N (X, I)
Prior: , Multivariate NIC (p, q, m, V ) i.e.
p(, ) (p+r+2)/2 exp(

1
{q + ( m)T V 1 ( m)})
2

E() = m Var() = V q/(p 2)

- Properties of Multivariate NIC (p, q, m, V ):


| N (m, V ), tp (m, qp1 V )
| IC(p + r, q + ( m)T V 1 ( m)), IC(p, q)

Posterior: , |Y NIC (p , q , m , V )
|Y tp (m , q /p V ) where
p = p + n
m)T (V + (X T X)1 )1 (
m),
q = q + S + (
PAGE 25

T (Y X )

S = (Y X )

11.1

c
HYON-JUNG
KIM, 2015

Simple regression

m = V (V 1 m + (X T X))
V = (V 1 + (X T X))1

For predictive inference, consider predicting a single observation y0 when the vector of
covariates is x0 . We find that
y0 |Y tp (xT0 m , q /p (1 + xT0 V x0 ))

11.1

Simple regression

Observations Yi are independent given (, , )


Yi |, , N ( + xi , )
This is a linear model in which r = 2 and = (, )T
For conjugate NIC (p, q, m, V ) prior,
m = (E(), E())T , V ar() = E( )V = q/(p 2)V ,and V ar( ) =

2q 2
(p2)2 (p4)

The posterior distribution can be obtained using the general formulae in a straightforward way. However, the Bayesian analysis is complex enough that no simple formulae
can be found for the posterior means of and as for the frequentist estimates.

11.2

Two normal samples in linear model formulation

Data : Yij , for i = 1, 2 and j = 1, 2, ...., ni , r = 2 Set n = n1 + n2


Assume that variances for two samples are equal : 1 = 2 as ii)c. in page 24
Observations Yij have means E(Yij |1 , 2 ) = i and V ar(Yij |1 , 2 ) = .
Prior: Conjugate NIC prior for 1 , 2 ,
Posterior: 1 , 2 , |Y NIC as before in the general linear model analysis
Theorem 11.1 Let (, ) NIC(p, q, m, V ) and suppose that A be a q p matrix of
rank q. Let = A. Then, (, ) NIC(p, q, Am, AV AT ).
PAGE 26

11.2

Two normal samples in linear model formulation

For = 1 2 , Set A = (1, 1), V =

v11

v12

v21

v22

c
HYON-JUNG
KIM, 2015

Then,

(, )|Y N IC(p , q , m1 m2 , q /p (v11


+ v22
2v12
))

If prior covariance between 1 and 2 is zero, then

|Y tp+n (m1 m2 , q /p (v11


+ v22
))

EXAMPLE : Cuckoo eggs were found in the nests of two dierent host species were
examined. The lengths of the eggs (in mm) are
Hedge-sparrow : 22.0,23.9,20.9,23.8,25.0,24.0,21.7,23.8,22.8,23.1,23.1,23.5,23.0,23.0

Wren: 19.8,22.1,21.5,20.9,22.0,21.0,22.3,21.0,20.3,20.9,22.0,20.0,20.8,21.2,21.0

We would like to compare the two species of cuckoos and find out which species lay
larger eggs on average.
i) Allowing unequal population variances and assuming independent weak NIC priors
(1, 0, m, 01 ), obtain the posterior distribution for the mean dierence = 1 2
given the data.
ii) Assuming the weak N IC(2, 0, m, 01 ) and equal population variances for both
species, obtain the posterior distribution for the mean dierence = 1 2 given the
data.

PAGE 27

c
HYON-JUNG
KIM, 2015

11.3 Other multiparameter models

11.3

Other multiparameter models

Theorem by Lindely and Smith (1972), JRSS B : General Bayesian Linear Model
Likelihood: Y |1 N (A1 1 , C1 ) where A1 , C1 is known.
Prior: 1 N (A2 2 , C2 ) where A2 , 2 , C2 is known.
Posterior: 1 |Y N (Bb, B)
where B = (AT1 C11 A1 + C21 )1 and b = AT1 C11 Y + C21 A2 2
Marginal distribution of Y N (A1 A2 2 , C1 + A1 C2 AT1 )
Linear Model with known variance (revisited)
Likelihood: Y | N (X, I)
Prior: N ( 0 , V )
Multivariate Normal data
Likelihood: Y | N (, ) where is known.
Prior: N (0 , 0 )
Multinomial data
Yj : number of observations for the jth outcome category, j = 1, ..., k
Likelihood: Y | Multinomial (n, ) where n =

Yj

Prior: Dirichlet D(1 , ..., k )


p() =

(1 + ... + k ) 1 1
kk 1 ,
j = 1
1
(1 ) (k )
j

Posterior: Dirichlet D(1 + y1 , ..., k + yk )


Example: Survey on 1447 adults in US to find out their preferences in the presidential
election.
Y1 = 727 supported G. Bush, Y2 = 583 supported M. Dukakis, and Y3 = 173 had no
opinion. Take a noninformative D(1, 1, 1) prior.

PAGE 28

c
HYON-JUNG
KIM, 2015

12

Hierarchical Modeling

Exchangeability :
Random variables X1 , ..., Xn are exchangeable if their joint distribution is the same as
the joint distribution of any other selection of m distinct Xi s and this holds for all
m < n.
Whenever we model observations as being a sample from a distribution with some
unknown parameters, the frequentist model assumes that they are independent and
identically distributed (i.i.d.) . In Bayesian way, they are i.i.d. conditional on parameters, and the idea of exchangeability is needed to construct a probability model for
data.
The relevance of the concept of exchangeability is due to the fact that it allows one to
state a very powerful result known as the representation theorem of De Finetti (1937).
The theory is complicated and here we only consider the implication of his theorem
without stating it:
When X1 , ..., Xn are part of an infinite sequence of exchangeable random variables, we
can consider them as being a random sample from a distribution with some unknown
parameters.
Prior modeling: how is prior information structured in more complex models?
When we have many parameters, it can be a very large task to think about the joint
prior distribution of all these parameters (especially if our prior knowledge is such that
the parameters are not independent). In such situations, it can be very helpful to think
about modeling the parameters, in the same way as we construct models for data.
Hierarchical models
In the simplest hierarchical model, we first state the likelihood with the usual statistical
model that expresses the distribution of data X conditional on unknown parameters .
We then have a prior model that expresses the (prior) distribution of the parameters

PAGE 29

c
HYON-JUNG
KIM, 2015

conditional on unknown hyperparameters . We build the hierarchy with a probability


distribution (called a hyperprior distribution) for .
A more complex hierarchical model might add a further layer to the hierarchy by
modeling the distribution of conditional on a further level of parameters (hyperhyper parameters). Then at the end of the hierarchy we have a distribution for .
NOTE: we can have a dierent way up for the hierarchy by setting up the data model
p(X|) as the top level of the hierarchy, or at the bottom.

How to analyze hierarchical models? There are several possibilities and one of the
advantages is the flexibility it oers for inference. We wish to obtain the posterior
distribution of and .
When we set up hierarchical models, we model the data by writing it conditional on
as p(X|), but the implication is that the distribution of X depends on the parameters
but not on the parameters that are introduced at the next level of the hierarchy.
So p(X|) is also p(X|, ). It is simple to obtain p(, ) = p(|)p() and so
p(, |X) p(X|)p(|)p()
EXAMPLE : Twenty-three similar houses took part in an experiment into the eect
of ceiling insulation on heating requirements. Five dierent levels of ceiling insulation
were installed, and the amount of heat used (in kilowatt-hours) to heat each house over
a given month was measured.
Insulation(inches) Heat required(KWh)
4

14.4, 14.8, 15.2, 14.3, 14.6

14.5, 14.1, 14.6, 14.2

13.8, 14.1, 13.7, 13.6, 14.0

10

13.0, 13.4, 13.2

12

13.1, 12.8, 12.9, 13.2, 13.3, 12.7

PAGE 30

S-ar putea să vă placă și