Generalized Linear Models

Generalized Linear Models
Contents
1 Introducing GLMs
1.1 Examples of GLMs . . . . . .
1.2 Inference with GLMs . . . . .
1.3 glm in R . . . . . . . . . . . . .
1.4 glm in R: heart attack example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
4
5
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
14
16
16
17
17
18
18
18
19
19
3 Linking computation and theory

3.1 Model formulae and the specification of GLMs . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
20
21
4 Using the distributional results

4.1 Confidence interval calculation . . . . .
4.2 Single parameter tests . . . . . . . . . .
4.3 Hypothesis testing by model comparison
4.3.1 Known scale parameter example
4.3.2 Unknown scale parameter testing
. . . . .
. . . . .
. . . .
. . . . .
example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
23
23
23
24
5 Model selection more generally

5.1 Hypothesis testing based model selection
5.2 Prediction error based model selection .
5.3 Remarks on model selection . . . . . . .
5.4 Interpreting model coefficients . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
27
30
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Theory of GLMs
2.1 The exponential family of distributions .
2.2 Fitting generalized linear models . . . . .
2.3 The sampling distribution of . . . . . .
2.4 Comparing models by hypothesis testing
2.4.1 Deviance . . . . . . . . . . . . . .
2.4.2 Model comparison with unknown
2.5 and Pearsons statistic . . . . . . . . .
2.6 Residuals and model checking . . . . . . .
2.6.1 Pearson residuals . . . . . . . . . .
2.6.2 Deviance residuals . . . . . . . . .
2.6.3 Residual plots . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introducing GLMs
A linear model is a statistical model that can be written

i N (0, 2 )
yi = Xi + i ,
i.i.d
where yi is a response variable, X is a model matrix with elements usually depending on some predictor
variables , the i are random variables. is a vector of unknown parameters, and the purpose of statistical
inference with a linear model is to learn about from the data.
An exactly equivalent way of writing the linear model is,
E(yi ) i = Xi ,
yi
N (i , 2 ).
indep.
Generalized linear models extend linear models by allowing some non-linearity in the model structure
and much more flexibility in the specification of the distribution of the response variable yi . Specifically,
a GLM is a statistical model that can be written as
E(yi ) i = (Xi ),
yi
Exponential family distribution,
indep.
where is any smooth monotonic function and the Exponential family of distributions includes distributions such as Poisson, Gaussian (normal), binomial and gamma. A feature of exponential family
distributions is that their shape is largely determined by their mean, i , and possibly one other scale
parameter, usually denoted (e.g. for the normal distribution = 2 , for the Poisson, = 1). For such
distributions it is always possible to find a variance function V of i such that
var(yi ) = V (i ).
As we will see later, it is actually possible to relax the GLM distributional assumption and simply specify
V , using the theory of quasi-likelihood.
For historical reasons it is usual to write GLMs in terms of the (smooth monotonic) link function, g,
which is the inverse function of . i.e.
g(i ) = Xi ,
yi
Exponential family distribution.
indep.
Examples of commonly used link functions are the log, square root and logit (log odds ratio) functions
(see later). The model is written this way because statisticians were (and are) used to thinking about
models for transformed response data (i.e. transformed yi ). However it is important to realize that
modelling some data using a log link (for example) is very different to modelling log(yi ) using a linear
model. Note that X is known as the linear predictor (and is often given the symbol ).
1.1
Examples of GLMs
150
100
50
New AIDS cases
200
250
Example 1: AIDS in Belgium.
Year since 1980
also
referred to as explanatory variables or covariates.
10
12
The above figure shows new AIDS cases each year in Belgium, at the start of the epidemic. In the
early stages of an epidemic an exponential increase model is often appropriate, and a GLM can be used
to fit such a model. If yi is the number of new AIDS cases per year and ti denotes the number of years
since 1980, then a suitable model for the data might be,
E(yi ) i = eti ,
yi
Poi(i ).
indep.
Taking logs of both sides of the above equation and defining 1 log() and 2 , we can re-write the
model as
log(i ) = 1 + ti 2 , yi Poi(i ),
indep.
which is clearly a GLM with a log link and a model matrix whose ith row is Xi = [1, ti ].
0.3
0.2
0.1
0.0
Consumption.Rate
0.4
Example 2: Hen harriers and Grouse
20
40
60
80
100
120
Grouse.Density
The above plot shows the daily consumption rate of Grouse by Hen Harriers (a type of bird of prey)
plotted against the density of Grouse on various Grouse moors. If ci denotes consumption rate and di is
grouse density, then ecological theory suggests a saturating model for the data
E(ci ) i =
d3i
,
+ d3i
ci
Gamma,
indep.
where and are parameters to be estimated. Using the inverse link function we get
1
1
1
,
= +
i
d3i
ci
Gamma.
indep.
So defining new parameters 1 = 1/ and 2 = / we get the GLM

1
1
= 1 + 2 3 ,
i
di
ci
Gamma.
indep.
i.e. the GLM with an inverse link, Gamma distribution for the response, and model matrix whose ith
row is Xi = [1, d3
i ].
Example 3: Heart Attacks and Creatine Kinase
The following data are from a study examining the efficacy of blood creatine kinase levels as a
diagnostic when patients present with symptoms that may indicate a heart attack.
CK level
Heart Attack
Not Heart Attack
20
2
88
60
13
26
100
30
8
140
30
5
180
21
0
220
19
1
260
18
1
300
13
1
340
19
1
380
15
0
420
7
0
460
8
0
Here is a plot of the proportion of patients who subsequently turned out to have had a heart attack,
against their blood CK levels on admission to hospital.
3
1.0
0.8
0.6
0.4
Proportion HA
0.2
0.0
100
200
300
400
CK
A convenient model that captures the saturating nature of the relationship between the observed
proportions, pi , and the CK levels, xi is the logistic model
E(pi ) =
e1 +2 xi
1 + e1 +2 xi
If yi is the number of heart attack victims observed out of Ni patients with CK level xi then
i E(yi ) = Ni E(pi ) =
Ni e1 +2 xi
1 + e1 +2 xi
and treating the patients as independent we have yi bin(i /Ni , Ni ). This model doesnt look linear,
but if we apply the logit link function to both sides it becomes,
i
log
= 1 + 2 xi ,
Ni i
which is clearly a GLM.
Example 4: Linear models!
Any linear model is just a special case of a GLM. The link function is the identity link and the
response distribution is Gaussian.
1.2
Inference with GLMs
Inference with GLMs is based on the theory of maximum likelihood estimation. That is, given parameters
, we can write down, f (y; ) the probability or probability density function of the response y. Plugging
the observed data yobs into f , and treating it as a function of , we get the likelihood function for
L() = f (yobs ; ).
The idea is that values of that make the observed data appear relatively probable are more likely to
be correct than those that make them appear relatively improbable. Taking this notion to its logical
conclusion, the most likely values for the parameters are those that cause the likelihood to be as large as
For GLMs the likelihood is actually maximized
possible: these are the maximum likelihood estimates, .
wrt by iteratively re-weighted least squares (IRLS), so that successively improved estimates of are
found by fitting working linear models to suitably transformed response data.
do not depend on the scale parameter, , but when estimates of are required (e.g.
The estimates, ,
the variance of the Gaussian) then this is usually done separately, and not by MLE.
As we will see, large sample results turn out to imply that
N
(, (XT WX)1 )
(1)
It is usual not to distinguish notationally between the observed data and the arguments of the p(d)f, so it will not be
done in the rest of these notes.
where W is a diagonal matrix such that Wii = V (i )1 g 0 (i )2 . This result can be used to obtain
approximate confidence intervals for elements of .
Model comparison is done in one of two ways. Let l() = log{L()}. A hypothesis test that the
smaller of two nested models is correct is conducted using the generalized likelihood ratio test result. If
0 are the MLEs of a reduced version of a model with MLEs 1 , then if the reduced model is really
correct we have that
2{l(1 ) l(0 )}
2dim(1 )dim(0 ) .
If the quantity on the lhs is too large for consistency with the distribution on the rhs, then we would
doubt the hypothesis. If is unknown then this result is not directly useable, and an F-ratio based
generalization is required.
The second way of comparing models is via Akaikes Information Criteria (AIC, which Akaike himself
called An Information Criteria). Rather than sticking with the simpler of two models until there is
strong evidence that this is untenable, as with hypothesis testing, one instead chooses the model that
is estimated to do the best job at predicting new replicate data, to which it was not fitted. Using this
approach, whichever model minimizes,
+ 2dim()
AIC = 2l()
is selected.
Model checking for GLMs is performed using residuals, in the same way as for linear models. The
difference is that the distribution of GLM residuals will depend on the response distribution used, which
makes raw residuals difficult to interpret. For this reason residuals are usually standardized, so that they
behave somewhat like the residuals from a linear model. For example, a simple standardization is to
divide each residuals by its model predicted scaled standard deviation, so that all standardized residuals
should have the same variance, if the model is correct. i.e.
yi
i
i = p
V (
i )
these are called Pearson residuals.
1.3
glm in R
R has a function called glm for fitting GLMs to data. glm functions much like lm, except that the rhs of
the model formula now defines the way in which the link function of the expected response depends on
the predictor variables. In addition you have to tell glm what response variable distribution to use, and
what link function: this is done using the family argument, as we will see.
To see glm in action, consider the hen harrier data again. The data are stored in a data frame called
harrier. First plot them.
0.3
0.2
0.1
0.0
Consumption.Rate
0.4
with(harrier,plot(Grouse.Density,Consumption.Rate,ylim=c(0,.4),xlim=c(0,130)))
20
40
60
80
Grouse.Density
is available free from http://cran.r-project.org
100
120
Now fit the model discussed previously:

hm <- glm(Consumption.Rate ~ I(1/Grouse.Density^3),Gamma(link="inverse"),data=harrier)
Notice how the identity function I() is used on the rhs of the model formula in order to use Grouse
density to the power minus 3 as a predictor: this is necessary because of the special meaning of various
ordinary arithmetic operators within model formulae (e.g. + means and not the sum of in a model
formula). The second argument to glm specifies the distribution (here Gamma) and the link function to
use (here "inverse"). The final argument indicates where to find the columns of data referred to in the
formula.
glm returns a fitted model object (which has been stored in hm). Typing its name will cause a small
summary of the object to be displayed.
> hm
Call:
glm(formula = Consumption.Rate ~ I(1/Grouse.Density^3),

family = Gamma(link = "inverse"), data = harrier)
Coefficients:
(Intercept)
4.676e+00
I(1/Grouse.Density^3)
5.386e+05
Degrees of Freedom: 32 Total (i.e. Null); 31 Residual

Null Deviance:
16.17
Residual Deviance: 10.47
AIC: -92.38
As with any model, we should check residuals before proceeding.
par(mfrow=c(2,2)) ## get all plots on one page
plot(hm) ## plot residuals in various ways
Normal QQ
1.5
Residuals vs Fitted
7
2
1
0
Std. deviance resid.
0.5
0.0
1.0
Residuals
1.0
28
22
10
15
20
25
30
22
28
Predicted values
Residuals vs Leverage
3
ScaleLocation
1.5
7
0.5
37
39
0.0
0.5
1.0
22
28
Theoretical Quantiles
10
15
20
25
30
Cooks distance
0.00
Predicted values
0.5
0.05
0.10
Leverage
0.15
These residual plots are interpreted in much the same way as the equivalent plots for a linear model,
with two differences: i) If the response distribution is not normal/Gaussian then we dont expect the
normal QQ plot to show an exact straight line relationship and (ii) if our response is binary then checking
is more difficult, with many plots being effectively useless.
For the harrier model there are some clear patterns in the residuals, with the data for some densities
being entirely above or below the fitted line. For this simple example, it can help to plot model predictions
and data on the same plot. To make predictions at a series of grouse densities, we can use the predict
function.
pd <- data.frame(Grouse.Density=0:130) ## data at which to predict
fv <- predict(hm,pd,type="response")
## get model predictions
Note the type argument to predict, indicating that we want the predictions to be made on the original
response scale, not the link transformed scale (the default). Now we can produce a plot
par(mfrow=c(1,1))
with(harrier,plot(Grouse.Density,Consumption.Rate,ylim=c(0,.4),xlim=c(0,130))
lines(0:130,fv)
Actually it would be good to add approximate 95% CIs to the plot. Again predict is used, but this time
standard errors for each prediction are also produced (on the link transformed scale). The following adds
CIs to the plot.
0.3
0.2
0.1
0.0
Consumption.Rate
0.4
lp <- predict(hm,pd,se=TRUE)
lines(0:130,1/(lp$fit+2*lp$se.fit),lty=2)
lines(0:130,1/(lp$fit-2*lp$se.fit),lty=2)
20
40
60
80
100
120
Grouse.Density
Looking at this plot it is clear why we have patterns in the residuals, and only a very complex dependence
of consumption rate on density would cure it. Since the different densities actually relate to different
grouse moors, it is probable that there are some other moor-to-moor differences that should really be
included in this model perhaps relating to the alternative hen-harrier food available at each moor.
Clearly some more work is required here.
Is the cubic dependence on density really the best model for these data? There are various ways of
addressing this question, but one of the simplest is to try alternative powers, and compare the AICs of
the alternative model fits. For example.
> hm1 <- glm(Consumption.Rate~I(1/Grouse.Density^2),Gamma(link="inverse"),data=harrier)
> hm2 <- glm(Consumption.Rate~I(1/Grouse.Density),Gamma(link="inverse"),data=harrier)
> AIC(hm,hm1,hm2)
df
AIC
7
hm
hm1
hm2
3 -92.38289
3 -94.17016
3 -92.63887
So AIC actually supports the model with a quadratic dependence on density. Such a model saturates more
slowly than the original model, but also has problematic residual plots. Finally, for a larger summary of
a model, use the summary function. . .
> summary(hm1)
Call:
glm(formula = Consumption.Rate ~ I(1/Grouse.Density^2), family = Gamma(link = "inverse"),
data = harrier)
Deviance Residuals:
Min
1Q
Median
-1.05990 -0.43304 -0.01186
3Q
0.20319
Max
1.12450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.627
1.315
1.998 0.054558 .
I(1/Grouse.Density^2) 17674.876
4359.132
4.055 0.000314 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for Gamma family taken to be 0.3044807)
Null deviance: 16.1657
Residual deviance: 9.9461
AIC: -94.17
on 32
on 31
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5

The output here is much the same as that for a linear model, except that residual sums of squares are
replaced by deviances, of which more later. Note also that an estimate of the scale parameter of the
gamma distribution is provided: for some other distributions this parameter is just the known constant
1.
1.4
glm in R: heart attack example
Before getting more rigorous about the theory of GLMs it is worth going over one more practical example
in R. Again consider the heart attack data and model from section 1.1. First read in the data and plot
them.
ck<- c(20 , 60 , 100 , 140 , 180 , 220 , 260 , 300 , 340 , 380 , 420 , 460)
ha <- c(2 , 13 , 30 , 30 , 21 , 19 , 18 , 13 , 19 , 15 , 7 , 8 )
ok <- c( 88 , 26 , 8 , 5 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0)
heart <- data.frame(ck=ck,ha=ha,ok=ok)
p<-heart$ha/(heart$ha+heart$ok)
plot(heart$ck,p,xlab="Creatinine kinase level",
lab="Proportion Heart Attack")
1.0
0.8
0.6
0.4
Proportion Heart Attack
0.2
0.0
100
200
300
400
Creatinine kinase level
Recall that our basic model for these data is that, if yi is the number of heart attack victims out of
Ni patients at CK level xi , then
yi binom(i /Ni , Ni )
where E(yi ) i , g(i ) = 0 + 1 xi and g is the logit-link
i
g(i ) = log
,
Ni i
When using binomial models, we need to somehow supply the model fitting function with information
about Ni as well as yi . R offers two ways of doing this with glm.
1. The response variable can be the observed proportion of successful binomial trials, in which case
an array giving the number of trials must be supplied as the weights argument to glm. For binary
data, no weights vector need be supplied, as the default weights of 1 suffice.
2. The response variable can be supplied as a two column array, in which the first column gives the
number of binomial successes, and the second column is the number of binomial failures.
For the current example the second method will be used. Supplying 2 arrays on the r.h.s. of the model
formula involves using cbind. Here is a glm call which will fit the heart attack model:
> mod.0 <- glm(cbind(ha,ok)~ck,family=binomial(link=logit),
+ data=heart)
or we could have used
mod.0 <- glm(cbind(ha,ok)~ck,family=binomial,data=heart)
since the logit link is the default for the binomial. Here is the default information printed about the
model:
> mod.0
Call: glm(formula=cbind(ha,ok)~ck,family=binomial,data=heart)
Coefficients:
(Intercept)
-2.75834
ck
0.03124

Null Deviance:
271.7
AIC: 62.33
The deviance of a fitted model is defined as
= 2{lmax l()}
D()
9
where lmax is the largest value that the likelihood could take for the data being fitted (which is the
maximized likelihood for a model with one parameter per datum). The deviance is defined in this way
so that it behaves a little bit like the residual sum of squares for a linear model. Well cover deviance in
more depth later. For the moment note that for distributions for which = 1, the deviance should often
approximately follows a 2ndim() distribution if the model is correct (although the approximation is not
great).
In the output, the Null deviance is the deviance for a model with just a constant term, while the
Residual deviance is the deviance of the fitted model. These can be combined to give the proportion
deviance explained, a generalization of r2 , as follows:
> (271.7-36.93)/271.7
[1] 0.864078
AIC is the Akaike Information Criteria for the model,(it could also have been extracted using AIC(mod.0)).
Notice that the deviance is quite high for the 210 random variable that it should approximate if the
model is fitting well. In fact
> 1-pchisq(36.93,10)
[1] 5.819325e-05
shows that there is a very small probability of a 210 random variable being as large as 36.93. The residual
plots also suggest a poor fit.
> op <- par(mfrow=c(2,2))
> plot(mod.0)
10
0.5
0.5
1.0
10
ScaleLocation
10
12
Predicted values
1.5
1
0.5
0.5
1
0
10
10
Cooks distance
30
1.5
Predicted values
12
20
2
1
Residuals
10
Normal QQ
Residuals vs Fitted
3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Leverage
Again, the plots have much the same interpretation as the model checking plots for an ordinary linear
model, except that it is now standardized residuals that are plotted (actually deviance residuals see
later), the Predicted values are on the scale of the linear predictor rather than the response, and some
departure from a straight line relationship in the Normal QQ plot is often to be expected. The plots are
not easy to interpret when there are so few data, but there appears to be a trend in the mean of the
residuals plotted against fitted value, which would cause concern. Furthermore, the first point has very
high influence. Note that the interpretation of the residuals would be much more difficult for binary data
(see later).
Notice how the problems do not stand out so clearly from a plot of the fitted values overlayed on the
raw estimated probabilities:
10
0.8
0.6
0.4
0.2
0.0
1.0
> plot(heart$ck,p,xlab="Creatinine kinase level",

+ ylab="Proportion Heart Attack")
> lines(heart$ck,fitted(mod.0))
100
200
300
400
Note also that the fitted values provided by glm for binomial models are the estimated pi s, rather
than the estimated i s.
The trend in the mean of the residuals suggests trying a cubic linear predictor, rather than the initial
straight line.
> mod.2 <- glm(cbind(ha,ok)~ck+I(ck^2)+I(ck^3),family=binomial,
+ data=heart)
> mod.2
Call: glm(formula=cbind(ha,ok)~ck+I(ck^2)+I(ck^3),
family=binomial,data=heart)
Coefficients:
(Intercept)
-5.786e+00
ck
1.102e-01
I(ck^2)
-4.648e-04
I(ck^3)
6.448e-07

Null Deviance:
271.7
AIC: 33.66
> par(mfrow=c(2,2))
> plot(mod.2)
Normal QQ
10
4
1.5
0.5
0.5
1.0
Predicted values
ScaleLocation
10
2.0
1.0
10
10
1
0.5
0.5
1
Cooks distance4
0.0
Predicted values
0.1
0.2
0.3
0.4
Leverage
11
1.5
3.0
10
10
0.0
1.0
0.0
10
1.0
Residuals
10 5
Residuals vs Fitted
0.5
0.6
0.7
Clearly 4.252 is not too large for consistency with a 28 distribution (it is less than the expected value,
in fact) and the AIC has improved substantially. The residual plots now show less clear patterns than for
the previous model, although if we had more data then such a departure from constant variance would
be a cause for concern. Furthermore the fit is clearly closer to the data now:
0.8
0.6
0.4
0.0
0.2
1.0
par(mfrow=c(1,1))
plot(heart$ck,p,xlab="Creatinine kinase level",
ylab="Proportion Heart Attack")
lines(heart$ck,fitted(mod.2))
100
200
300
400
We can also get R to test the null hypothesis that mod.0 is correct against the alternative that mod.2
is required. Somewhat confusingly the anova function is used to do this, although it is a generalized
likelihood ratio test that is being performed, and not an analysis of variance.
> anova(mod.0,mod.2,test="Chisq")
Analysis of Deviance Table
Model 1:
Model 2:
Resid.
1
2
cbind(ha, ok) ~ ck
cbind(ha, ok) ~ ck + I(ck^2) + I(ck^3)
Df Resid. Dev Df Deviance P(>|Chi|)
10
36.929
8
4.252 2
32.676 8.025e-08
A p-value this low indicates very strong evidence against the null hypothesis: we really do need model
2. Note that this comparison of models has a much firmer theoretical basis than the examination of the
individual deviances had.
Theory of GLMs
This section will cover the theory of GLMs in more depth, filling in the details of the framework outlined
in section 1.2. The section starts by reviewing key results for exponential family distributions, then covers
model estimation, before covering variance estimation, model comparison etc.
2.1
The exponential family of distributions
The response variable in a GLM can have any distribution from the exponential family. A distribution
belongs to the exponential family of distributions if its probability density function, or probability mass
function, can be written as
f (y) = exp [{y b()}/a() + c(y, )] ,
where b, a and c are arbitrary functions, an arbitrary scale parameter, and is known as the canonical
parameter of the distribution (in the GLM context, will completely depend on the model parameters
, but it is not necessary to make this explicit yet).
12
For example, it is easy to see that the normal distribution is a member of the exponential family since
1
(y )2
exp
f (y) =
2 2
2
2
y + 2y 2
= exp
2)
log(
2 2
y2
y 2 /2
= exp
log(
2)
,
2
2 2
2
2
2
2
which
form, with
= , b() = /2 /2, a() = = and c(, y) = y /(2)
is of exponential
2
2
log( 2) y /(2 ) log( 2).
It is possible to obtain general expressions for the the mean and variance of exponential family
distributions, in terms of a, b and . The log likelihood of , given a particular y, is simply log[f (y)]
considered as a function of . That is
l() = {y b()}/a() + c(y, )

and so
l
= {y b0 ()}/a().
Treating l as a random variable, by replacing the particular observation y by the random variable Y ,
enables the expected value of l/ to be evaluated:

l
E
= {E(Y ) b0 ()}/a().
Using the general result that E(l/) = 0 (at the true value of ) and re-arranging implies that
E(Y ) = b0 ().
(2)
i.e. the mean, of any exponential family random variable, is given by the first derivative of b w.r.t.
, where the form of b depends on the particular distribution. This equation is the key to linking the
model parameters, , of a GLM to the canonical parameters of the exponential family. In a GLM, the
parameters determine the mean of the response variable, and, via (2), they thereby determine the
canonical parameter for each response observation.
Differentiating the likelihood once more yields
2l
= b00 ()/a(),
2
and plugging this into the general result, E( 2 l/2 ) = E[(l/)2 ] (the derivatives are evaluated at
the true value), gives
b00 ()/a() = E {Y b0 ()}2 /a()2 ,

which re-arranges to the second useful general result:
var(Y ) = b00 ()a().
a could in principle be any function of , and when working with GLMs there is no difficulty in handling
any form of a, if is known. However, when is unknown matters become awkward, unless we can write
a() = /, where is a known constant. This restricted form in fact covers all the cases of practical
interest here. a() = / allows the possibility of, for example, unequal variances in models based on
the normal distribution, but in most cases is simply 1. Hence we now have
var(Y ) = b00 ()/.
(3)
In subsequent sections it will often be convenient to consider var(Y ) as a function of E(Y ), and,
since and are linked via (2), we can always define a variance function V () = b00 ()/, such that
var(Y ) = V ().
13
2.2
Fitting generalized linear models
Recall that a GLM models an n-vector of independent response variables, Y, where E(Y), via
g(i ) = Xi
and
Yi fi (yi ),
where fi (yi ) indicates an exponential family distribution, with canonical parameter i , which is determined by i (via equation 2) and hence ultimately by . Given vector y, an observation of Y, maximum
likelihood estimation of is possible. Since the Yi are mutually independent, the likelihood of is
L() =
n
Y
fi (yi ),
i=1
and hence the log-likelihood of is

l() =
=
n
X
log[fi (yi )]
i=1
n
X
{yi i bi (i )}/ai () + ci (, yi ),
i=1
where the dependence of the right hand side on is through the dependence of the i on . Notice
that the functions a, b and c may vary with i this allows different binomial denominators, ni , for each
observation of a binomial response, or different (but known to within a constant) variances for normal
responses, for example. , on the other hand, is assumed to be the same for all i. As discussed in the
previous section, for practical work it suffices to consider only cases where we can write ai () = /i ,
where i is a known constant (usually 1), in which case
l() =
n
X
i {yi i bi (i )}/ + ci (, yi ).
i=1
Maximization proceeds by partially differentiating l w.r.t. each element of , setting the resulting
expressions to zero and solving for .
n
l
1X
i
i
0
=
i yi
bi (i )
,
j
i=1
j
j
and by the chain rule
i i
i
=
,
j
i j
so that differentiating (2), we get

i
1
i
= b00i (i )
= 00
,
i
i
bi (i )
which then implies that
l
1 X [yi b0i (i )] i
=
.
j
i=1 b00i (i )/i j
Substituting from (2) and (3), into this last equation, implies that the equations to solve for are
n
X
(yi i ) i
i=1
V (i ) j
14
= 0 j.
(4)
However, these equations are exactly the equations that would have to be solved in order to find by
non-linear weighted least squares, if the weights V (i ) were known in advance and were independent of
. In this case the least squares objective would be
S=
n
X
(yi i )2
V (i )
i=1
(5)
where i depends non-linearly on , but the weights V (i ) are treated as fixed. To find the least squares
estimates involves solving S/j = 0 j, but this system of equations is easily seen to be (4), when the
V (i ) terms are treated as fixed.
This correspondence suggests a fitting method. Iterate the following two steps to convergence
1. Given the current
i estimates, evaluate the V (
i ) values
2. Find a value of which reduces
X (yi i )2
V (
i )
Let this improved parameter vector be denoted ,

(the dependence on is through , but not ).
and use it to update .

At convergence must satisfy (4).
To implement this method we need to be able to find the required improved parameter vectors at step
2. To do this, just replace i by its first order Taylor expansion around
i , so that
yi i ' yi
i
X i
(j j )
j
j
Now, writing the linear predictor as

With exact equality at = (derivatives evaluated at current ).
i = Xi
i
di i
Xij
=
= 0
.
j
di j
g (i )
Hence
X (yi i )2
i
V (
i )
'
X (g 0 (
2
i )yi g 0 (
i )i Xi + Xi )
g 0 (
i )2 V (
i )
wi (zi Xi )2
(6)
(7)
i )(yi
i )2 V (
where zi = g 0 (
i ) + Xi and wi = g 0 (
i )1 . But (7) is just a weighted linear least
squares problem, which is easily minimized w.r.t. using standard least squares methods, making it
easy to find an improved .

Hence we arrive at the following GLM fitting algorithm. Iterate the following to convergence. . .
estimates, calculate pseudodata z and weights w, as defined above.
1. Given the current and
2. Minimize
wi (zi Xi )2
w.r.t. to obtain an improved estimate .

3. Evaluate a new linear predictor estimate = X and new fitted values
i = g 1 (
i ).
If, as is usual, the method converges to fixed then this must satisfy (4) and is hence the MLE of .
= y (with modification to avoid e.g. log(0)). The method is
The iteration can be started by setting
known as Iteratively Re-weighted Least Squares (IRLS).
15
The sampling distribution of
2.3
There is a general Maximum Likelihood Estimation result that if is an MLE (and some technical
conditions are met) then
N
(, I 1 )
where I is the information matrix, with elements Iij = E(l/j l/i ). The result is exact in the
large sample limit, or for the normal response, identity link case. To use this result we need to find I for
a GLM.
First define vector u such that j = l/j . Then I = E(uuT ). From results already established, we
have that
n
l
1 X Xij (yi i )
uj =
=
j
i=1 V (i )g 0 (i )
If we define diagonal matrices G and V, where Gii = g 0 (i ) and Vii = V (i ), then this last result becomes
u = XT G1 V1 (y )/.
Hence,
E(uuT ) =
=
=
XT G1 V1 E[(Y )(Y )T ]V1 G1 X/2

XT G1 V1 VV1 G1 X/
XT WX/
since E[(Y )(Y )T ] = V and W = V1 G2 .

So we end up with
N
(, (XT WX)1 ).
(8)
For distributions with known scale parameter, , this result can be used directly to find confidence
intervals for the parameters, but if the scale parameter is unknown (e.g. for the normal distribution),
then it must be estimated, and intervals must be based on an appropriate t distribution. Scale parameter
estimation is covered later.
2.4
Comparing models by hypothesis testing
Consider testing
H0 : g() = X0 0
against
H1 : g() = X1 1 ,
where is the expectation of a response vector, Y, whose elements are independent random variables
from the same member of the exponential family of distributions, and where X0 X1 . If we have an
observation, y, of the response vector, then a generalized likelihood ratio test can be performed. Let l(0 )
and l(1 ) be the maximized log-likelihoods of the two models. If H0 is true then in the large sample
limit,
2[l(1 ) l(0 )] 2p1 p0 ,
(9)
where pi is the number of (identifiable) parameters (i ) in model i. If the null hypothesis is false, then
model 1 will tend to have a substantially higher likelihood than model 0, so that twice the difference in
log likelihoods would be too large for consistency with the relevant 2 distribution.
The approximate result (9) is only directly useful if the log likelihoods of the models concerned can
be calculated. In the case of GLMs estimated by IRLS, this is only the case if the scale parameter, ,
is known. Hence the result can be used directly with Poisson and binomial models, but not with the
normal , gamma or inverse Gaussian distributions, where the scale parameter is not known. What to do
in these latter cases will be discussed shortly.
Of
course for normal distribution and identity link we use the results of chapter 1.
16
2.4.1
Deviance
When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similar
way to the residual sum of squares, in ordinary linear modelling. This quantity is the deviance of the
model and is defined as
D
=
=
2[l(max ) l()]
n
h
i
X
2i yi (i i ) b(i ) + b(i ) ,
(10)
(11)
i=1
where l(max ) indicates the maximized log-likelihood of the saturated model: the model with one parameter per data point. l(max ) is the highest value that the log- likelihood could possibly have, given
= y and evaluating the log-likelihood. and denote the
the data, and is evaluated by simply setting
maximum likelihood estimates of canonical parameters, for the saturated model and model of interest,
respectively. Notice how the deviance is defined to be independent of .
Related to the deviance is the scaled deviance,
D = D/,
which does depend on the scale parameter. For the Binomial and Poisson distributions, where = 1,
the deviance and scaled deviance are the same, but this is not the case more generally.
By the generalized likelihood ratio test result (9), we might expect that, if the model is correct, then
approximately
D 2np ,
(12)
in the large sample limit. Actually such an argument is bogus, since the limiting argument justifying (9)
relies on the number of parameters in the model staying fixed, while the sample size tends to infinity, but
the saturated model has as many parameters as data. Asymptotic results are available for some exponential family distributions, to justify (12) as a large sample approximation under many circumstances,
and it is exact for the Normal case. Note, however, that it breaks down entirely for the binomial with
binary data.
Given the definition of deviance, it is easy to see that the log likelihood ratio statistic in (9), can be
re-expressed as D0 D1 . So under H0
D0 D1 2p1 p0
(13)
(in the large sample limit), where Di is the deviance of model i which has pi identifiable parameters.
But again, this is only useful if the scale parameter is known so that D can be calculated.
2.4.2
Model comparison with unknown
Under H0 we have the approximate results

D0 D1 2p1 p0 and D1 2np ,
and, if D0 D1 and D1 are treated as asymptotically independent, this implies that
F =
(D0 D1 )/(p1 p0 )
Fp1 p0 ,np1 ,
D1 /(n p1 )
in the large sample limit (a result which is exactly true in the ordinary linear model special case, of
course). The useful property of F is that it can be calculated without knowing , which can be cancelled
from top and bottom of the ratio yielding, under H0 , the approximate result that
F =
(D0 D1 )/(p1 p0 )
F
p1 p0 ,np1 .
D1 /(n p1 )
(14)
The advantage of this result is that it can be used for hypothesis testing based model comparison, when
is unknown. The disadvantages are the dubious distributional assumption for D1 , and the independence
approximation, on which it is based.
17
2.5
and Pearsons statistic
As we have seen, the MLEs of the parameters, , can be obtained without knowing the scale parameter,
, but, in those cases in which this parameter is unknown, it must usually be estimated. Approximate
result (12) provides one obvious estimator. The expected value of a 2np random variable is n p, so
equating the observed D = D/ to its approximate expected value and re-arranging, we get
D = D/(n
p).
(15)
A second estimator is based on the Pearson statistic, which is defined as

X2 =
n
X
(yi
i )2
i=1
V (
i )
Clearly X 2 / would be the sum of squares of a set of zero mean, unit variance, random variables, having
n p degrees of freedom, suggesting that if the model is adequate then approximately X 2 / 2np :
this approximation turns out to be well founded. Setting the observed Pearson statistic to its expected
value, and re-arranging, yields
2 /(n p).
= X
Note that it is straightforward to show that
2
X =
i=n
X
2,
wi (zi Xi )
where wi and zi are IRLS weights and pseudodata, evaluated at convergence.
2.6
Residuals and model checking
We have now assembled the basic theory required for inference with GLMs, but before using the distributional results for inference, it is always necessary to check that the model meets its assumptions well
enough that the results are likely to be valid. Model checking is perhaps the most important part of
applied statistical modelling. In the case of ordinary linear models, this is based on examination of the
model residuals, which contain all the information in the data, not explained by the systematic part of
the model. Examination of residuals is also the chief means for model checking in the case of GLMs, but
in this case the standardization of residuals is both necessary and a little more difficult.
For GLMs the main reason for not simply examining the raw residuals, i = yi
i , is the difficulty
of checking the validity of the assumed mean variance relationship from the raw residuals. For example,
if a Poisson model is employed, then the variance of the residuals should increase in direct proportion to
the size of the fitted values (
i ). However if raw residuals are plotted against fitted values it takes an
extraordinary ability to judge whether the residual variability is increasing in proportion to the mean, as
opposed to, say, the square root or square of the mean. For this reason it is usual to standardize GLM
residuals, in such a way that, if the model assumptions are correct, the standardized residuals should
have approximately equal variance, and behave, as far as possible, like residuals from an ordinary linear
model.
2.6.1
Pearson residuals
The most obvious way to standardize the residuals is to divide them by a quantity proportional to their
standard deviation according to the fitted model. This gives rise to the Pearson residuals
yi
i
pi = p
,
V (
i )
Recall
that if {Zi : i = 1 . . . n} are a set of i.i.d. N (0, 1) r.v.s then
18
Zi2 2n .
which should have approximately zero mean and variance , if the model is correct. These residuals
should not display any trend in mean or variance when plotted against the fitted values, or any covariates
(whether included in the model or not). The name Pearson residuals relates to the fact that the sum of
squares of the Pearson residuals gives the Pearson statistic discussed in section 2.5.
Note that the Pearson residuals are the residuals of the working linear model from the converged
IRLS method, divided by the square roots of the converged IRLS weights.
2.6.2
Deviance residuals
In practice the distribution of the Pearson residuals can be quite asymmetric around zero, so that their
behaviour is not as close to ordinary linear model residuals as might be hoped for. The deviance residuals
are often preferable in this respect. The deviance residuals are arrived at by noting that the deviance
plays much the same role for GLMs that the residual sum of squares plays for ordinary linear models:
indeed for an ordinary linear model the deviance is the residual sum of squares. In the ordinary linear
model case, the deviance is made up of the sum of the squared residuals. That is the residuals are the
square roots of the components of the deviance with the appropriate sign attached.
So, writing di as the component of the deviance contributed by the ith datum (i.e. the ith term in
the summation in (11)) we have
n
X
D=
di
i=1
and, by analogy with the ordinary linear model, we can define

p
i ) di .
di = sign(yi
As required the sum of squares of these deviance residuals gives the deviance itself.
Now if the deviance were calculated for a model where all the parameters were known, then (12) would
become D 2n , and this might suggest that for a single datum di / 21 , implying that di N (0, ).
Of course (12) can not reasonably be applied to a single datum, but nonetheless it suggests that we might
expect the deviance residuals to behave something like N (0, ) random variables, for a well fitting model,
especially in cases for which (12) is expected to be a reasonable approximation.
2.6.3
Residual plots
Once you have standardized residuals you should plot them to try and find evidence that the model
assumptions are not met. The main useful plots are:
Standardized residuals against fitted values. A trend in the mean of the residuals violates the
independence assumption and often implies that something is wrong with the model from the mean
of the response perhaps a missing dependence, or the wrong link function. A trend in the
variability of the residuals is diagnostic of a problem with the assumed mean variance relationship
i.e. with the assumed response distribution.
Standardized residuals against all potential predictor variables (selected or omitted from the model).
Trends in the mean of the residuals can be very useful for pinpointing missing dependencies of the
mean response on the predictors.
Normal QQ plots can be useful for highlighting problems with the distributional assumptions, in
cases where the response distribution can be well approximated by a normal distribution (with
appropriate non-constant variance). For example Poisson residuals for a response with a fairly high
mean fall into this category.
Plots of standardized residuals against leverage are useful for highlighting single points that have a
very high influence on the model fitting. leverage is a measure of how influential a data point could
be, based on the distance of its predictor variables from the predictors of other data.
19
All plots are useful for spotting potential outliers: points which do not fit well with the pattern of
the rest of the data, and deserve special attention, to check that they are not somehow erroneous, or
that they are not telling you something important about the system that the data relate to. Note that R
always labels the three largest outliers in a residual plot with their data frame row numbers. Of course
the fact that they are labeled does not in itself mean that they are problematic.
Linking computation and theory
To use the theoretical results effectively you need to be able to specify any GLM you want to fit, in R,
and extract the quantities required by the theory from R output of various sorts.
3.1
Model formulae and the specification of GLMs
Specification of the response distribution and link function is via the family argument of glm and the
examples already covered provide sufficient illustration of this. Specification of the structure of the linear
predictor is more involved, and now that we have covered a number of examples, a more formal discussion
of model formulae is appropriate.
The main components of a formula are all present in the following example
y ~ a*b + x:z + offset(v) -1
Note the following:
~ separates the response variable, on the left, from the linear predictor, on the right. So in the
example y is the response and a, b, x, z and v are the predictors.
+ indicates that the response depends on what is to the left of + and what is to the right of it.
Hence within formulae + should be thought of as and rather than the sum of.
: indicates that the response depends on the interaction of the variables to the left and right of
:. Interactions are obtained by forming the element-wise products of all model matrix columns
corresponding to the two terms that are interacting and appending the resulting columns to the
model matrix (although, of course, some identifiability constraints may be required).k
* means that the response depends on whatever is to the left of * and whatever is to the right of
it and their interaction, i.e. a*b is just a shorter way of writing a + b + a:b.
offset(v) indicates that a column should be included in the model matrix, specified by v, whose
corresponding parameter has the known value 1.
-1 means that the default intercept term should not be included in the model. Note that, for
models involving factor variables, this often has no real impact on the model structure, but simply
reduces the number of identifiability constraints by one, while changing the interpretation of some
parameters.
Because of the way that some symbols change their usual meaning in model formulae, it is necessary to
take special measures if the usual meaning is to be restored to arithmetic operations within a formula.
This is accomplished by using the identity function I() which simply evaluates its argument and returns
it. For example, if we wanted to fit the model:
yi = 0 + 1 (xi + zi ) + 2 vi + i
then we could do so using the model formula
y ~ I(x+z) + v
k See
section 6.3 of the the MA20035 notes for a reminder of what interactions of factor variables are.
20
Note that there is no need to protect arithmetic operations within arguments to other functions in this
way. For example
yi = 0 + 1 log(xi + zi ) + 2 vi + i
would be fitted correctly by
y ~ log(x+z) + v
3.1.1
An example
Consider a study looking at the relationship between smoking, drinking and blood pressure. A group of
adult male patients were selected randomly from a GP practice. Each patient had their blood pressure,
yi , measured, along with their smoker status (smoker or non-smoker), their alcohol consumption rate
(none, within recommended limits or heavy) and their age in years, xi . The following initial model was
proposed.
E(yi ) = + j + k + jk + j xi if patient i is in smoker class j drinker class k
yi gamma. So the model has main effects for smoking and drinking, and interaction of these factors, and
a separate linear dependence on age for smokers and non-smokers (and interaction of age and smoking).
Remember that variables are said to interact when the effect of one predictor itself depends on the value
of another predictor.
The model is a GLM with an identity link and gamma distribution. Suppose that we have the
following predictor variables for 10 patients:
age, x 44 38 39 41 44 37 44 44 42 41
smoke F
F
F
F
F
T T T T T
drink 1
2
3
1
2
3
1
2
3
1
Note that smoke and drink are factor variables here, while age is a continuous predictor. Here is the
corresponding linear predictor (which gives E(yi ) directly in this case), in identifiable form.
1
1 0 0 0 0 0 44 0
2 1 0 1 0 0 0 38 0
3 1 0 0 1 0 0 39 0 2
4 1 0 0 0 0 0 41 0 2
5 1 0 1 0 0 0 44 0 3
6 1 1 0 1 0 1 0 37 22
7 1 1 0 0 0 0 0 44 23
8 1 1 1 0 1 0 0 44 1
9 1 1 0 1 0 1 0 42
2
10
1 1 0 0 0 0 0 41
In R the model distribution and link would be specified using Gamma(link="identity"), while the model
formula to specify response and linear predictor could be written as:
y~smoke+drink+smoke:drink + smoke:age
This form is the clearest translation of the model structure into R, but note that any of the following
would give the same model (although the identifiability constraints may change between them, which will
alter the meaning of some of the parameters).
y~smoke*drink+age:smoke
y~smoke*drink+age*smoke
y~smoke*drink+age:smoke-1
. . . and more!
21
Note that within R you can check the model matrix of a GLM using the model.matrix command. The
argument of model.matrix can be a fitted GLM object, a model formula, or even just the rhs of a
formula. For example
model.matrix(~smoke+drink+smoke:drink+age:smoke)
generates the model matrix given earlier in this section.
Using the distributional results
Once model passes basic residual checking, were in a position to treat it as a good enough model to do
some formal statistics. This involves using the distributional results (8), (12) and (14).
4.1
Confidence interval calculation
Result (8) is useful for finding confidence intervals for model parameters, and linear transformations of
the estimated covariance matrix of ( is known to be 1 in some cases).
= (XT WX)1 ,
them. Let V
Let
i be the square root of the ith diagonal element of V , i.e. the estimated standard error of j .
Confidence intervals for i . Using standard theory for normally distributed estimators. . .
1. A 100(1 )% CI for i , when is known (e.g. Poisson or binomial cases) is
i t (1 /2)
i
where t (1 /2) is the 1 /2 critical point of a standard normal distribution.
2. A 100(1 )% CI for )i, when is unknown (e.g. gaussian or gamma cases) is
i tndim() (1 /2)
i
where tk (1 /2) is the 1 /2 critical point of a tk distribution.
Except in the normal response, identity link case, both results are only approximate, since they are based
on (8), which is only approximate.
R reports i and
i values in the Estimate and Std.Error columns of the Coefficients table of a
from a glm fitted model object.
glm fitted model summary. Function vcov is used to extract V
Confidence intervals for the linear predictor and expected response. Since has an approximately normal distribution, so does any linear transformation of it, such as the linear predictor for the
There is a standard result that if Z and U are random vectors with covariance
ith observation i = Xi .
matrices Vz and Vu then if Z = BU where B is a matrix of fixed coefficients then Vz = BVu BT .
Applying this implies that
2i = var(
i ) = Xi V XT
i.
So we have that i N (, 2i ). Hence a 100(1 )% CI for i is
i tk (1 /2)
i
where tk is tndim() if is unknown and t otherwise. Given a CI for i an equivalent CI for i = E(yi )
is easily obtained:
1
g (
i tk (1 /2)
i ), g 1 (
i + tk (1 /2)
i )
Derivation of this latter interval is easy. If (a, b) is a 95% CI for i then it includes the true i with
probability, 0.95. But in that case [g 1 (a), g 1 (b)] must include the true i = g 1 (i ) with probability
0.95, making it a 95% CI for i .
R function predict(mod,type="link",se=TRUE) will return the fitted values i and associated standard errors
i , for model mod, in elements fit and se.fit of the object it returns. If predict is
supplied with new values for the predictors then it will produce predictions and standard errors for the
linear predictor corresponding to these, instead.
22
4.2
Single parameter tests
Using the same notation as in the previous section, result (8) is also the basis for simple hypothesis testing
about single parameters. Consider testing H0 : j = 0 vs. H1 : j 6= 0. Under H0 we have
j N
(0, 2 ).
j
If is known then this becomes
j /
j N
(0, 1)
and we can calculate a p-value for the hypothesis test in the usual way, by evaluating
Pr[|Z| |j /
j |] where Z N (0, 1).
With estimated we use
j /
j t
ndim()
under H0 , and we can calculate a p-value, by evaluating

Pr[|T | |j /
j |] where T tndim() .
In R glm summary output j /
j is reported in the t value or z value column while the corresponding p-values are in the Pr(>|t|) or Pr(>|z|) columns.
4.3
Hypothesis testing by model comparison
Hypothesis tests of the sort developed in section 2.4 are easily performed in R, as follows:
1. Fit the GLM embodying the null hypotheses, m0, say.
2. Fit the model embodying the alternative hypothesis, m1, say. m1 must be an extended version of
m0, so that m0 is nested in m1.
3. Use anova(m0,m1,test="Chisq") to compare the models by direct use of a generalized likelihood
ratio test (13), if is known. Alternatively use anova(m0,m1,test="F") to compare the models
using an F ratio test, (14), when is not known.
4.3.1
Known scale parameter example
Case-control studies are an important type of study, in which a group of patients with some disease (the
cases) are compared to a randomly selected group of healthy subjects from the same population as the
cases (the controls). Variables that might be associated with the disease are also collected for all subjects.
If a variable is really associated with the disease then it ought to be predictive of whether a randomly
selected patient in the study is a case or a control. Such predictivity can be assessed using GLMs, of the
logistic regression type.
For example consider a study looking at 143 cases of malignant melanoma (a serious skin cancer)
in white male patients aged 25 to 55, classified according to skin type (A, B or C for celtic, middle
european type or Mediterranean), compared to 356 white male controls aged 25-55 (selected without
further reference to age or skin type). Patients were divided into 3 groups according to an age factor
variable, as well as being classified into 3 groups by the skin factor variable (so there are 9 groups in
total). The data are in a data frame md1:
1
2
3
4
mel
15
8
7
26
age skin n
25-35
A 54
25-35
B 52
25-35
C 44
35-45
A 75
23
5
6
7
8
9
18
6
30
25
8
35-45
35-45
45-55
45-55
45-55
B
C
A
B
C
52
42
67
66
47
Consider testing the null hypothesis that skin type is not associated with melanoma, against the
alternative that it is. If we neglect the possibility of an interaction then the null model
mel0 <- glm(mel/n~age,family=binomial,data=md1,weights=n)
is compared to the alternative
mel1 <- glm(mel/n~age+skin,family=binomial,data=md1,weights=n)
using
> anova(mel0,mel1,test="Chisq")
Model 1: mel/n ~ age
Model 2: mel/n ~ age + skin
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
6
20.5062
2
4
3.4389 2 17.0673
0.0002
test="Chisq" specifies that a generalized likelihood ratio test is to be performed using (13). Here the
p-value is very low: there is a very small probability of observing this large a difference in deviance
between the two models if mel0 really generated the data. This strongly suggests that mel0 is incorrect.
In other words, there is strong evidence in favour of mel1 and an effect of skin type on melanoma risk.
The next step would be to examine the model coefficients to ascertain the nature and size of the effect.
Note that these case-control studies can only be used to look at the relative risk of melanoma given
different risk factors. The study tells us nothing about the absolute risk of melanoma, because we have
chosen the ratio of cases to controls, rather than observing it in the population of interest.
4.3.2
Unknown scale parameter testing example
The dataset motori is derived from dataset motorins from R library faraway . It contains insurance
company data from Sweden, on payouts (Payment) in relation to number Insured, km travelled (a numeric
variable with 5 discrete values), Make of car (a factor variable with 9 levels) and number of years no claims
Bonus. An initial model for the data is:
E(Paymenti ) = Insuredi riski ,
where Payment gamma
so
log{E(Paymenti )} = log(Insuredi ) + log(riski ).
log(Insuredi ) is an example of a model offset a predictor variable whose coefficient is fixed at 1.
log(riski ) can be modelled using a linear model structure to give:
log{E(Paymenti )} = log(Insuredi ) + j kmi + j + Bonusi , if i is from Make j.
Consider testing H0 : 1 = 2 = = 9 against the alternative that the j are not all equal. First fit
models embodying the two hypotheses:
gl <- glm(Payment~offset(log(Insured))+km*Make+Bonus,Gamma(link="log"),motori)
g0 <- glm(Payment~offset(log(Insured))+km+Make+Bonus,Gamma(link="log"),motori)
available
from cran.r-project.org to accompany Extending the linear model with R by Julian Faraway.
24
Since is not known for the gamma, we now perform an F-ratio test comparing the models
> anova(g0,gl,test="F")
Model 1: Payment ~ offset(log(Insured)) +
Model 2: Payment ~ offset(log(Insured)) +
Resid. Df Resid. Dev Df Deviance
F
1
284
155.056
2
276
151.890
8
3.166 0.752
km + Make + Bonus
km * Make + Bonus
Pr(>F)
0.6455
It appears that the dependence of claim rate on km travelled can be assumed not to vary with car make.
Model selection more generally
The hypothesis tests considered above are examples of rather simple model selection problems. A question
was formulated in terms of which of two alternative versions of a model generated the observed response
data, and one of the models was selected (with preference being given to the simpler model). Often our
uncertainty does not amount to a straightforward choice between two alternatives, but instead we wish
to find the best model for a set of data, from some rather large set of possibilities.
Typically we can write down the most complex model we think is reasonable for a set of data, but
believe that a number of the model coefficients should really be zero. Model selection is about identifying
which coefficients those are. There are two basic strategies:
1. We may want to favour simplicity, and try to find the simplest model that we can get away with
from the available possibilities. This suggests developing approaches based on successive application
of the hypothesis testing methods already developed.
2. We may simply want to find the best model for prediction from among the candidates. That is
we wish to try to find the model that would be best at predicting new data from the system that
generated the original data. This can be done by selecting between models on the basis of Akaikes
Information Criterion, or the alternative Bayesian Information Criterion.
5.1
Hypothesis testing based model selection
The method of backward selection is one way of performing model selection using hypothesis testing
methods. It works like this:
1. Decide on a threshold p-value, , below which a term will always be retained in a model.
2. Fit the largest model under consideration.
3. Evaluate p-values for testing equality to zero of all model terms, except factor variables or their
interactions that are involved in higher order interactions in the model (e.g. dont look at p-values
for two factors if their interaction is present in the model).
4. Refit the model without the single term with the highest p-value above . If there are no such
terms to drop then stop. Otherwise return to step 3.
Notice that only one term at a time is dropped at step 4. This is very important. If two covariates are
highly correlated then dropping one can make a major difference to the p-value associated with the other.
This means that dropping more than one variable at a time is dangerous.
The p-values at step 3 are obtained either from the t-ratio or z-ratio results of section 4.2, or by
comparing models with and without the term concerned, using the GLRT or F-ratio results of section
4.3. For terms involving factor variables, the second option is usually the only viable one, since such
terms usually have multiple coefficients.
25
As an example of backwards selection consider the semiconductor electrical resistance data given in
Faraway (2005) as the wafer data frame. Four factors (x1 to x4) in the manufacturing process were
believed to influence semiconductor resistance, resist, and an experiment was conducted to try out all
combinations of two levels of each. The data are as follows:
> wafer
x1 x2 x3 x4 resist
1
- - - - 193.4
2
+ - - - 247.6
3
- + - - 168.2
4
+ + - - 205.0
5
- - + - 303.4
6
+ - + - 339.9
7
- + + - 226.3
8
+ + + - 208.3
9
- - - + 220.0
10 + - - + 256.4
11 - + - + 165.7
12 + + - + 203.5
13 - - + + 285.0
14 + - + + 268.0
15 - + + + 169.1
16 + + + + 208.5
An initial model is fitted with all interactions of the factors up to 3rd order. i.e.
wm <- glm(resist~x1*x2*x3+x1*x3*x4+x1*x2*x4+x2*x3*x4,
Gamma(link="log"),data=wafer)
Applying backward selection, we now want to calculate the p-values associated with dropping each 3 way
interaction form the full model (one at a time!). The drop1 function in R is a very convenient way of
doing this. It goes through all the terms in the model that can be dropped on their own, refits without
each of them in turn and compares each resulting fit with the original model. . .
> drop1(wm,test="F")
Single term deletions
Model:
resist ~ x1 * x2 * x3 + x1 * x3 * x4
x4
Df Deviance
AIC F value
<none>
0.008 129.726
x1:x2:x3 1
0.009 127.764 0.0380
x1:x3:x4 1
0.011 128.035 0.3094
x1:x2:x4 1
0.029 130.144 2.4173
x2:x3:x4 1
0.011 128.012 0.2867
+ x1 * x2 * x4 + x2 * x3 *
Pr(F)
0.8775
0.6769
0.3639
0.6871
The rows of the table are labelled with the names of the dropped terms. The reported p-value in the final
row is for testing the null hypothesis that the model without the dropped term is adequate (against the
alternative that the full model is needed). The AIC for each model under consideration is also reported.
The p-values suggest dropping the interaction x1:x2:x3, and
wm1 <- glm(resist~x1*x3*x4+x1*x2*x4+x2*x3*x4,
achieves this: note that although x1*x2*x3 has been omitted, this only causes dropping of the 3 way
interaction, since the main effects and two way interactions implied by the omitted term are duplicated
in terms that are left in the model formula. The p-values are now re-calculated for the re-fitted model.
26
> drop1(wm1,test="F")
Model:
resist ~ x1 * x3 * x4 + x1 * x2 * x4
Df Deviance
AIC F value
<none>
0.009 128.322
x1:x3:x4 1
0.011 126.918 0.5961
x1:x4:x2 1
0.029 130.981 4.6577
x3:x4:x2 1
0.011 126.874 0.5523
+ x2 * x3 * x4
Pr(F)
0.5208
0.1636
0.5348
Notice how they have changed from the previous set of p-values. Now the x3:x4:x2 interaction is the
one to drop. Repeating these steps we eventually end up with
> wm7 <- glm(resist~x1*x3+ x3*x4 +x2*x3,Gamma(link="log"),data=wafer)
> drop1(wm7,test="F")
Model:
resist ~ x1 * x3 + x3 * x4 + x2 * x3
Df Deviance
AIC F value
Pr(F)
<none>
0.036 139.199
x1:x3
1
0.060 142.540 5.3311 0.04977 *
x3:x4
1
0.069 144.510 7.2970 0.02702 *
x3:x2
1
0.067 144.061 6.8491 0.03079 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
So if = 0.05, this is the final model. i.e. All main effects are present, along with 3 two way interactions.
Once a model has been selected, then the coefficient estimates would be examined and interpreted,
possibly with the aid of confidence intervals. Note however that inference performed with the final fitted
model tends to overstate what can really be concluded from the data, since it does not allow for the
uncertainty in model selection.
Forward selection is sometimes used as an alternative to backward selection. It starts from a very
simple model, and tries adding terms, using hypothesis tests to ascertain whether an addition is worthwhile. The difficulty with it is that the distributional results on which the hypothesis tests are based rely
on at least the alternative model being correct (if possibly over-complex). With forward selection this
assumption is always violated at the outset.
5.2
Prediction error based model selection
Suppose that a vector of response data y was really generated from a pdf f0 (y), and that a GLM for y
implies that it was generated from pdf f (y). A measure of the mismatch between model and reality is
provided by the Kullback-Leibler distance:
Z
K = {log[f0 (y)] log[f (y)]}f0 (y)dy.
A model with a low K is obviously a good thing. In fact it is possible to estimate the expected K-L
distance for any particular model fit. It can be shown that selecting between models in order to minimize
this estimated expected K-L distance amounts to selecting models on the basis of their ability to minimize
Akaikes Information Criterion:
+ 2dim()
AIC = 2l()
is the maximized log-likelihood of the model.
where l()
27
An alternative to AIC is the Bayesian Information Criteria, BIC, which penalizes model complexity
more heavily. It is defined as
+ log (n)dim().
BIC = 2l()
e
AIC and BIC can be used in place of hypothesis testing in backward model selection: the model with
the lowest AIC/BIC score always being the one selected at each stage. Unlike hypothesis testing methods
AIC and BIC can be used to compare models that are not nested (although the comparisons are a bit
more reliable in the nested case).
As an example of using AIC lets redo the wafer model selection example using AIC for backwards
selection.
> library(faraway)
> data(wafer)
> wm <- glm(resist~x1*x2*x3+x1*x3*x4+x1*x2*x4+x2*x3*x4,
+
> drop1(wm)
Model:
resist ~ x1 * x2 * x3 + x1 * x3 * x4 + x1 * x2 * x4 + x2 * x3 *
x4
Df Deviance
AIC
<none>
0.008 129.726
x1:x2:x3 1
0.009 127.764
x1:x3:x4 1
0.011 128.035
x1:x2:x4 1
0.029 130.144
x2:x3:x4 1
0.011 128.012
Since no test was specified drop1 simply evaluates the AIC for the full model (wm in this case) and versions
of it omitting all possible single terms. The model with the lowest AIC is then selected. In this instance
it is the model that omits x1:x2:x3, so that term would be dropped. The easiest way to refit a model
omitting some terms is to use the update function, as follows. . .
> wm1 <- update(wm,.~.-x1:x2:x3)
update refits a model identical to wm, except that it has no x1:x2:x3 interaction term, and returns the
resulting fitted model object, here stored in wm1. The . on the lhs of the update model formula indicates
that the response variable is as for wm, while the . on the rhs indicates that the linear predictor should
be exactly as for wm, except for the term explicitly omitted by -x1:x2:x3.
Now wm1 can be used like any other fitted glm object.
> drop1(wm1)
Model:
resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +
x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4 + x2:x3:x4
Df Deviance
AIC
<none>
0.009 128.322
x1:x3:x4 1
0.011 126.918
x1:x2:x4 1
0.029 130.981
x2:x3:x4 1
0.011 126.874
Notice one wrinkle: the AIC reported for wm1 is 128.322, but when we used drop1 before on wm, it
suggested that the AIC for wm1 would be 127.764. This happens because we need a scale parameter
estimate in order to evaluate the AIC, and drop1 always uses the same estimate for all the models it
28
compares, based on the largest model it is considering. Hence the two calls to drop1 give different AIC
estimates for the same model, because the different calls are using different scale parameter estimates.
(This is nothing to do with having used update, by the way.) Strictly speaking the AIC should be
evaluated using the MLE of , in which case this problem would not occur, but it makes the computation
much more expensive (and less reliable) if we do this. Of course the problem does not arise if is known.
Continuing . . .
> wm2 <- update(wm1,.~.-x2:x3:x4)
> drop1(wm2)
Model:
x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4
Df Deviance
AIC
<none>
0.011 130.224
x2:x3
1
0.043 136.823
x1:x3:x4 1
0.014 128.925
x1:x2:x4 1
0.031 133.701
> wm3 <- update(wm2,.~.-x1:x3:x4)
> drop1(wm3)
Model:
x3:x4 + x2:x4 + x1:x2:x4
Df Deviance
AIC
<none>
0.014 131.582
x1:x3
1
0.038 136.723
x2:x3
1
0.045 138.879
x3:x4
1
0.047 139.386
x1:x2:x4 1
0.034 135.503
. . . at which point we would select wm3 and proceed to examine its coefficients and interpret the model fit.
Notice that the AIC selected model is quite a bit more complex than the model selected by hypothesis
testing. This is typical. BIC, in contrast, selects simpler models than AIC, and for large sample sizes
can select simpler models than hypothesis testing based methods as well (although this depends on the
level used, of course).
Given the rather algorithmic nature of the selection process, it is possible to automate it entirely.
The step function will perform the whole backwards selection-by-AIC process for you, with one function
call. . .
> wm3a <- step(wm)
Start: AIC= 129.73
resist ~ x1 * x2 * x3 + x1 * x3 * x4 + x1 * x2 * x4 + x2 * x3 *
x4
- x1:x2:x3
- x2:x3:x4
- x1:x3:x4
<none>
- x1:x2:x4
Df Deviance
AIC
1
0.009 127.764
1
0.011 128.012
1
0.011 128.035
0.008 129.726
1
0.029 130.144
29
Step: AIC= 128.32

x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4 + x2:x3:x4
- x2:x3:x4
- x1:x3:x4
<none>
- x1:x2:x4
Df Deviance
AIC
1
0.011 126.874
1
0.011 126.918
0.009 128.322
1
0.029 130.981
Step: AIC= 130.22

x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4
Call:
glm(formula = resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 +

x1:x4 + x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4, family = Gamma(link = "log"),
data = wafer)
Coefficients:
(Intercept)
5.2586
x1+:x3+
-0.2071
x1+:x2+:x4+
0.2845
x1+
0.2843
x2+:x3+
-0.1783
x2+
-0.1273
x1+:x4+
-0.1852
x3+
0.4626
x3+:x4+
-0.2339
x4+
0.1502
x2+:x4+
-0.1863
x1+:x2+
-0.1228
x1+:x3+:x4+
0.1018

Null Deviance:
0.6978
AIC: 130.2
Notice that the finally selected model is a little different to the one that was selected using drop1. This
is again down to how the scale parameter is handled: its done differently in step and drop1, and for
this model, this has made a slight difference to the finally selected model.
Finally note that another possibility with AIC/BIC is all subsets model selection, in which every
possible sub-model of the most complex model is considered, and the one with the lowest AIC/BIC is
finally selected.
5.3
Remarks on model selection
Finally lets review the reasons for model selection.

1. We do model selection because we are often uncertain about the exact form that a model should
take, even though it is often possible to write down a model that we expect to be complicated
enough, so that that for some parameter values it should be a reasonable approximation to the
truth.
2. Selection is important for interpretational reasons: simpler models are easy to interpret than complex ones.
3. Model selection also tends to improve the precision of estimates and the accuracy of model predictions. If a model contains more terms than necessary then we will inevitably use up information
in the data in estimating the associated coefficients, which in turn means that the important terms
are less precisely estimated.
30
Whether model selection is performed using AIC or model selection depends on the purposes of the
analysis. If we want to develop a model for prediction purposes then it makes sense to use AIC, but if
our interest lies in trying to understand relationships between the predictors and the response, it may be
preferable to use hypothesis testing based methods to try and avoid including model terms unless there
is good evidence that they are needed.
Finally note that there is a difficult problem associated with model selection:
It is common practice to use model selection methods to choose one model from a large set of
potential models, but then to treat the selected model exactly as if it were the only model we
ever considered, when it comes to calculating confidence intervals etc. In doing this we neglect the
uncertainty associated with model selection, and will therefore tend to overstate how precisely we
know the coefficients of the selected model (and how precise its predictions are). This issue is an
active area of current statistical research.
5.4
Interpreting model coefficients
Once a model is selected and checked, then usually you will want to examine and interpret its estimated
coefficients. For many of the examples that we have met the interpretation of parameters is obvious, but
for complex models it can be less easy, especially with factor variables when identifiability constraints
are needed. The failsafe way to check the meaning of each coefficient in practical modelling is to examine
the model matrix. For example, the summary for model wm3 from the previous section is.
> summary(wm3)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.27133
0.05060 104.176 5.09e-08 ***
x1+
0.25858
0.06532
3.958 0.01670 *
x2+
-0.12711
0.06532 -1.946 0.12354
x3+
0.43720
0.05843
7.483 0.00171 **
x4+
0.12466
0.06532
1.908 0.12899
x1+:x2+
-0.12222
0.08263 -1.479 0.21321
x1+:x3+
-0.15622
0.05843 -2.674 0.05559 .
x2+:x3+
-0.17825
0.05843 -3.051 0.03800 *
x1+:x4+
-0.13430
0.08263 -1.625 0.17941
x3+:x4+
-0.18306
0.05843 -3.133 0.03508 *
x2+:x4+
-0.18610
0.08263 -2.252 0.08743 .
x1+:x2+:x4+ 0.28450
0.11686
2.435 0.07162 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
...
This can look intimidating, although in fact the parameter names are pretty helpful here. They basically
tell you the circumstance under which the coefficient of a factor will be added to the model. For example
if x1 is in the + state for some response measurement, then we include the x1+ term in the model (which
just amounts to adding .25858 to the linear predictor in this case, since x1 is a factor). If x1 and x2 are
in the + state for some response measurement, then terms x1+, x2+ and x1+:x2+ are included, and so on.
To make things completely clear, however, look at the model matrix (and original data frame).
31
> wafer
x1 x2 x3 x4 resist
1
- - - - 193.4
2
+ - - - 247.6
3
- + - - 168.2
4
+ + - - 205.0
5
- - + - 303.4
6
+ - + - 339.9
7
- + + - 226.3
8
+ + + - 208.3
9
- - - + 220.0
10 + - - + 256.4
11 - + - + 165.7
12 + + - + 203.5
13 - - + + 285.0
14 + - + + 268.0
15 - + + + 169.1
16 + + + + 208.5
> model.matrix(wm3)
(Intercept) x1+ x2+ x3+ x4+ x1+:x2+ x1+:x3+ x2+:x3+ x1+:x4+ x3+:x4+ x2+:x4+ x1+:x2+:x4+
1
1
0
0
0
0
0
0
0
0
0
0
0
2
1
1
0
0
0
0
0
0
0
0
0
0
3
1
0
1
0
0
0
0
0
0
0
0
0
4
1
1
1
0
0
1
0
0
0
0
0
0
5
1
0
0
1
0
0
0
0
0
0
0
0
6
1
1
0
1
0
0
1
0
0
0
0
0
7
1
0
1
1
0
0
0
1
0
0
0
0
8
1
1
1
1
0
1
1
1
0
0
0
0
9
1
0
0
0
1
0
0
0
0
0
0
0
10
1
1
0
0
1
0
0
0
1
0
0
0
11
1
0
1
0
1
0
0
0
0
0
1
0
12
1
1
1
0
1
1
0
0
1
0
1
1
13
1
0
0
1
1
0
0
0
0
1
0
0
14
1
1
0
1
1
0
1
0
1
1
0
0
15
1
0
1
1
1
0
0
1
0
1
1
0
16
1
1
1
1
1
1
1
1
1
1
1
1
It is now clear that the intercept is the expected resistance for a wafer where none of the factors are
in the + state. The coefficients x1+ to x4+ give the expected increase in resistivity when just one of the
factors is in the + state (referring back to the summary, x1 and x3 seem to lead to a significant increase
in resistance, on their own). So what about the interactions? Look at x1+:x2+ as an example its an
adjustment that is added on when x1 and x2 are both in the + state together: i.e. it is how much different
the expected resistivity is to what you would expect if the effects of x1 and x2 both being + simply added
to each other. Referring back to the summary, it seems that when two factors are in the + state, the
resistivity is lower than you would expect from just looking at their effects on their own (although not
all the interaction coefficients are significantly different from 0).
Confidence interval calculation for parameters is covered in section 4.1.
32
Index
AIC, 5, 7, 10, 25, 27, 29
AIDS example, 3
known scale parameter, 24

hypothesis tests
single parameter, 23
BIC, 25, 28, 29

binomial distribution, 4, 14
binomial GLMs in R, 9, 24
fitted values, 11
blood pressure example, 21
information matrix, 16
interpreting model coefficents, 31
IRLS, 4, 15
starting values, 15
iterative weights, 5, 15, 18
canonical parameter, 12
case-control study, 23
confidence intervals
and scale parameter, 22
for fitted values, 7, 22
22
for linear transformations of ,
for parameters, 22
transforming scale, 22
correlated covariates, 25
Kullback-Leibler distance, 27
leverage, 19
likelihood, 13
likelihood of GLM, 4
linear predictor, 2, 21
link
identity, 21
link function, 2
default, 9
inverse, 3, 6
logit, 4
log likelihood, 17
logistic regression model, 4, 23
deviance, 8, 9, 17, 19
approximate distribution, 10, 17
null, 10
proportion explained, 10
scaled, 17
4, 16
distribution of ,
maximum likelihood estimation, 4, 14

large sample results, 16
melanoma example, 23
model checking, 5
model comparison, 5, 23
model matrix, 21, 31
model selection
all subsets, 30
backward selection, 25
conditioning on final model, 31
forward selection, 27
motivation, 30
prediction error based, 27
models selection, 25
exponential family, 12
log likelihood, 13
mean, 13
variance, 13
F test, 17
formula
I(), 20
gamma distribution, 3, 21
generalized likelihood ratio test, 5, 12, 16, 17, 24
GLM
for binomial data, 9
in R, 5
likelihood, 14
standard form, 2, 14
p-value, 10, 12, 23, 24

Pearson statistic, 18
Poisson distribiution, 3
pseudodata, 15, 18
harrier example, 3, 6
hear attack example, 3
heart attack example, 8
hypothesis test
2 , 16, 24
F ratio, 17
model comparison, 23
hypothesis testing, 5, 16
unknown scale parameter, 25
QQ plots, 19
R, 5
R function
anova, 12, 24
cbind, 9
drop1, 26
33
glm, 5, 24, 26
I(), 6
model.matrix, 31
plot, 5, 7
predict, 7, 22
step, 29
summary, 6, 31
update, 28
with, 5
R model formula, 6, 20, 28
r2 , 10
residual plots, 7, 10, 19
residuals, 5, 18
deviance, 19
outliers, 20
Pearson, 5, 18
raw, 18
standardized, 18
response scale, 7
saturated model, 17
scale parameter, 2, 8, 12, 17
estimate, 18
known, 16
unknown, 17, 29
variance function, 2, 13
wafer example, 26, 28
34

Generalized Linear Models

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Generalized Linear Models

Încărcat de

Drepturi de autor:

Formate disponibile

Generalized Linear Models

3 Linking computation and theory

4 Using the distributional results

5 Model selection more generally

A linear model is a statistical model that can be written

Exponential family distribution,

Exponential family distribution.

New AIDS cases

Example 1: AIDS in Belgium.

Year since 1980

referred to as explanatory variables or covariates.

Example 2: Hen harriers and Grouse

So defining new parameters 1 = 1/ and 2 = / we get the GLM

Inference with GLMs

is available free from http://cran.r-project.org

Now fit the model discussed previously:

glm(formula = Consumption.Rate ~ I(1/Grouse.Density^3),

Degrees of Freedom: 32 Total (i.e. Null); 31 Residual

Std. deviance resid.

Std. deviance resid.

Std. deviance resid.

Number of Fisher Scoring iterations: 5

glm in R: heart attack example

Proportion Heart Attack

Creatinine kinase level

Degrees of Freedom: 11 Total (i.e. Null); 10 Residual

Std. deviance resid.

Std. deviance resid.

Std. deviance resid.

Proportion Heart Attack

> plot(heart$ck,p,xlab="Creatinine kinase level",

Creatinine kinase level

Degrees of Freedom: 11 Total (i.e. Null); 8 Residual

Std. deviance resid.

Std. deviance resid.

Std. deviance resid.

Proportion Heart Attack

Creatinine kinase level

The exponential family of distributions

l() = {y b()}/a() + c(y, )

b00 ()/a() = E {Y b0 ()}2 /a()2 ,

Fitting generalized linear models

and hence the log-likelihood of is

so that differentiating (2), we get

Let this improved parameter vector be denoted ,

and use it to update .

Now, writing the linear predictor as

easy to find an improved .

w.r.t. to obtain an improved estimate .

The sampling distribution of

XT G1 V1 E[(Y )(Y )T ]V1 G1 X/2

since E[(Y )(Y )T ] = V and W = V1 G2 .

Comparing models by hypothesis testing

Model comparison with unknown

Under H0 we have the approximate results

and Pearsons statistic

A second estimator is based on the Pearson statistic, which is defined as

where wi and zi are IRLS weights and pseudodata, evaluated at convergence.

Residuals and model checking

that if {Zi : i = 1 . . . n} are a set of i.i.d. N (0, 1) r.v.s then

and, by analogy with the ordinary linear model, we can define

Linking computation and theory

Model formulae and the specification of GLMs

Using the distributional results