Sunteți pe pagina 1din 10

Linear Regression Models: A Bayesian perspective

Ingredients of a linear model include an n 1 response


vector y = (y1 , . . . , yn )T and an n p design matrix (e.g.
including regressors) X = [x1 , . . . , xp ], assumed to have
been observed without error. The linear model:
y = X + ;  N (0, 2 I)
The linear model is the most fundamental of all serious
statistical models encompassing:
ANOVA: y is continuous, xi s are categorical
REGRESSION: y is continuous, xi s are continuous
ANCOVA: y is continuous, some xi s are continuous, some
categorical.

Unknown parameters include the regression parameters


and the variance 2 . We assume X is observed without
error and all inference is conditional on X.
1

Linear Regression Models: A Bayesian perspective

The classical unbiased estimates of the regression


parameter and 2 are
= (X T X)1 X T y;

1
T (y X ).

2 =
(y X )
np
The above estimate of is also a least-squares estimate.
The predicted value of y is given by
= PX y where PX = X(X T X)1 X T .
= X
y
PX is called the projector of X. It projects any vector to the
space spanned by the columns of X.
The model residual is estimated as:
T (y X )
= yT (I PX )y.
= (y X )
e
2

Bayesian regression with flat reference priors

For Bayesian analysis, we will need to specify priors for the


unknown regression parameters and the variance 2 .
Consider independent flat priors on and log 2 :
p() 1; p(log( 2 )) 1 or equivalently p(, 2 )

1
.
2

None of the above two distributions are valid probabilities


(they do not integrate to any finite number). So why is it
that we are even discussing them?
It turns out that even if the priors are improper (thats what
we call them), as long as the resulting posterior
distributions are valid we can still conduct legitimate
statistical inference on them.
3

Marginal and conditional distributions

With a flat prior on we obtain, after some algebra, the


conditional posterior distribution:
p( | 2 , y) = N ( | (X T X)1 X T y, 2 (X T X)1 ).
The conditional posterior distribution of would have been
the desired posterior distribution had 2 been known.
Since that is not the case, we need to obtain the marginal
posterior distribution by integrating out 2 as:
Z
p( | y) = p( | 2 , y)p( 2 | y)d 2
Can we solve this integration using composition sampling?
YES: if we can generate samples from p( 2 | y)!
4

Marginal and conditional distributions

So, we need to find the marginal posterior distribution of


2 . With the choice of the flat prior we obtain:


1
(n p)s2
p( 2 | y) 2 (np)/2+1 exp
2 2
( )


2
2 n p (n p)s
= IG |
,
,
2
2
where s2 =
2 =

1
T
np y (I

PX )y.

This is known as an inverted Gamma distribution (also


called a scaled chi-square distribution)
IG( 2 | (n p)/2, (n p)s2 /2).
In other words: [(n p)s2 / 2 | y] 2np (with n p
degrees of freedom). A striking similarity with the classical
result: The distribution of
2 is also characterized as
2
2
(n p)s / following a chi-square distribution.
5

Composition sampling for linear regression

Now we are ready to carry out composittion sampling from


p(, 2 | y) as follows:
Draw M samples from p( 2 | y):


n p (n p)s2
2(j)

IG
,
(n p) , j = 1, . . . M
2
2
For j = 1, . . . , M , draw from p( | 2(j) , y):


(j) N (X T X)1 X T y, 2(j) (X T X)1

The resulting samples { (j) , 2(j) }M


j=1 represent M
2
samples from p(, | y).
{ (j) }M
j=1 are samples from the marginal posterior
distribution p( | y). This is a multivariate t density:
p( | y) =

(n/2)
((n p))p/2 ((n p)/2)|s2 (X T X)1 |

"
1+

#
T (X T X)( )
n/2
( )
(n p)s2

Composition sampling for linear regression

The marginal distribution of each individual regression


parameter j is a non-central univariate tnp distribution.
In fact,
j j
q
tnp .
s (X T X)1
jj
The 95% credible intervals for each j are constructed
from the quantiles of the t-distribution. The credible
intervals exactly coicide with the 95% classical confidence
intervals, but the intepretation is direct: the probability of j
falling in that interval, given the observed data, is 0.95.
Note: an intercept only linear model reduces to the simple
univariate N (
y | , 2 /n) likelihood, for which the marginal
posterior of is:
y
tn1 .
s/ n
7

Bayesian predictions from the linear model

and we
Suppose we have observed the new predictors X,

wish to predict the outcome y. We specify p(y, y | ) to be a


normal distribution:
 




X
y
I 0
2
N
,

y
0 I
X
2 I).
| y, , 2 ) = p(y
| , 2 ) = N (y
| X,
Note p(y
The posterior predictive distribution:
Z
| y) = p(y
| y, , 2 )p(, 2 | y)dd 2
p(y
Z
| , 2 )p(, 2 | y)dd 2 .
= p(y
By now we are comfortable evaluating such integrals:
First obtain: ( (j) , 2(j) ) p(, 2 | y), j = 1, . . . , M
(j) , 2(j) I).
(j) N (X
Next draw: y
8

Gibbs sampler for the linear regression model

Consider the linear model with p( 2 ) = IG( 2 | a, b) and


p() 1.
The Gibbs sampler proceeds by computing the full
conditional distributions:
p( | y, 2 ) = N ( | (X T X)1 X T y, 2 (X T X)1 )


1
T
2
2
p( | y, ) = IG | a + n/2, b + (y X) (y X) .
2
Thus, the Gibbs sampler will initialize ( (0) , 2(0) ) and
draw, for j = 1, . . . , M :
T
1 T
2(j1)
Draw (j) N ((X
(X T X)1 )
 X) X y,

Draw 2(j) IG a + n/2, b + 12 (y X (j) )T (y X (j) )

Metropolis algorithm for the linear model

Example: For the linear model, our parameters are (, 2 ). We write = (, log( 2 )) and, at the j-th
iteration, propose N ( (j1) , ). The log transformation on 2 ensures that all components of
have support on the entire real line and can have meaningful proposed values from the multivariate normal.
But we need to transform our prior to p(, log( 2 )).

Let z = log( 2 ) and assume p(, z) = p()p(z). Let us derive p(z). REMEMBER: we need to adjust
for the jacobian. Then p(z) = p( 2 )|d 2 /dz| = p(ez )ez . The jacobian here is ez = 2 .

Let p() = 1 and an p( 2 ) = IG( 2 | a, b). Then log-posterior is:

(a + n/2 + 1)z + z

1
ez

{b +

1
2

(Y X) (Y X)}.

A symmetric proposal distribution, say q( | (j1) , ) = N ( (j1) , ), cancels out in r. In practice


it is better to compute log(r): log(r) = log(p( | y) log(p( (j1) | y)). For the proposal,
N ( (j1) , ), is a d d variance-covariance matrix, and d = dim() = p + 1.

If log r 0 then set (j) = . If log r 0 then draw U (0, 1). If U r (or log U log r) then
(j) = . Otherwise, (j) = (j1) .

Repeat the above procedure for j = 1, . . . M to obtain samples (1) , . . . , (M ) .

10

S-ar putea să vă placă și