Documente Academic
Documente Profesional
Documente Cultură
C. J. Schwarz
Department of Statistics and Actuarial Science, Simon Fraser University
cschwarz@stat.sfu.ca
1
Chapter 1
Maximum Likelihood
Estimation & AIC and model
selection - a primer
2
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
• in large samples is fully efficient, i.e. extracts all the information possible
from data.
• provides a standard way to obtain estimates of precision (se) for any data
collection scheme along with a way to estimate confidence intervals.
However, MLE is not a panacea and there are several “problems” with MLE
that may require other methods (such as Bayesian methods):
• The properties of MLE in small samples may not be optimal nor exact
The first step in using MLE is to define a probability model for your data. This is
crucial because if your probability model is wrong, then all further inference will
also be incorrect. Statisticians recognize that all probability models are wrong,
but hope that they are close enough to reality to be useful. Consequently an
important part of MLE is model assessment – how well does the model fit the
data. This won’t be covered in this primer – contact me for details.
For example, consider the problem of estimating the density of objects (grey
dots) in a region as shown on the next page
2010
c Carl James Schwarz 3
4
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Of course in real life the locations of all the objects would be unknown
because if the locations (and number of objects) were known, then it would be a
simple matter to count all of the dots and divide by the study area to obtain the
density value! Consequently, three small areas were randomly selected from the
area (how would this be done?) and the number of objects within the sampling
quadrats are counted as shown on the next page:
2010
c Carl James Schwarz 5
6
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
This gives three counts, 4, 10, and 15. Note that the sampling protocol has
to be explicit on what to do with objects that intersect the boundary of the
sampling frame – this will not be discussed further in these notes.
A MLE approach would start with a model for the data. In many cases,
counts of objects follow a Poisson distribution. This would be true if objects
occurred “at random” on the landscape. It is not always true – for example,
objects could tend to fall in clusters (new plants growing from seeds from a
parent plant), or objects could be more spread out (competition for resources)
than random. In both cases, a more complex probability model could be fit. If
a simple Poisson model is used, an important part of the MLE process is model
assessment as noted earlier.
A Poisson distribution depends upon the parameter µ which (in this case) is
the mean number of dots per sampling unit or the true density in the population.
Let Y represent the random variable for the number of objects found in each
sample. If the value of µ is known, then the probability of seeing Y = y objects
2
in a sample follows the Poisson probability law:
e−µ µy
P (Y = y|µ) =
y!
e−1 12
For example, if µ = 1, then P (Y = 2|µ = 1) = 2! = .184.
Of course, the value of µ is NOT known and this is the parameter that we
would like to estimate from our sample of size 3.
2010
c Carl James Schwarz 7
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
values of Y . The likelihood function is a function of µ given the data and does
NOT sum to 1 over the possible values of µ. [Note that while the possible values
of Y were 0, 1, 2, 3, . . ., the possible values are µ is the non-negative real line,
i.e. 0 ≤ µ.]
So if our sample values were 4, 10 and 15, the likelihood function is:
Fisher argued that a sensible estimate for µ would be the value that MAXI-
MIZES the LIKELIHOOD (that is why this procedure is called Maximum Like-
lihood). This intuitively says that the best guess for the unknown parameter
maximizes the joint probability of the data and is the most consistent with the
data.
The MLE could be found using graphical methods, i.e. plot L as a function
of µ and find the value of µ that maximizes the curve as shown below:
2010
c Carl James Schwarz 8
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Recall from calculus that the maximum of a function often occurs when
then derivative (the slope of the line is zero). As the function increases to
the maximum the derivate is positive; as the function decreases away from the
maximum, the slope is negative. It switches from a positive slope to a negative
slope at the maximum.
2010
c Carl James Schwarz 9
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
problem:
log(L) =log(L(µ|Y1 , Y2 , Y3 )) = −µ + Y1 log(µ) − log(Y1 !)+
− µ + Y2 log(µ) − log(Y2 !) + −µ + Y3 log(µ) − log(Y3 !)
= − 3µ + (Y1 + Y2 + Y3 )log(µ) − log(Y1 !) − log(Y2 !) − log(Y3 !)
To find the maximum of log(L), find the first derivative with respect to µ,
equate to 0, and solve for µ:
∂ Y1 + Y2 + Y3
log(L) = −3 + =0
∂µ µ
or (after rearranging the above we have)
Y1 + Y2 + Y3
µ
b= =Y
3
or the sample average. The circumflex over the parameter is the usual conven-
tion to indicate that the value is an estimator derived from data rather than a
known parameter value, i.e. µb is a best-guess for µ.
Whew! This seems like a lot of work to get an “obvious” result, but the
advantages of MLE will be come clearer in future examples.
4+10+15
So in our example, µ
b= 3 = 9.67.
2010
c Carl James Schwarz 10
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
This doesn’t seem too helpful, as it depends upon the value of µ, so the
observed information is found by substituting in the value of the MLE:
Y1 + Y2 + Y3
OI =
b2
µ
which after some arithmetic gives
3
OI =
µ
b
√
9.67
In our case, we have se = √
3
= 1.80
This looks a bit odd compared to the familiar se(Y ) = √sn for samples taken
from a simple random sample. However one of the properties of the Poisson
√
distribution is that the standard deviation of the data is equal to the µ so in
fact the se of the MLE is not that different.
Once the se is found, then confidence intervals can be found in the usual
way, i.e. an approximate 95% confidence interval is µ b ± 2se. There are other
ways to find confidence intervals using the likelihood function directly (called
profile intervals) but these are not covered in this review.
Don’t forget that model assessment must be performed to ensure that our
choice of a Poisson distribution is sensible probability model!
However, finding the MLE, the information matrix, and the se can be done
using numerical methods. For example, consider the PoissonEqual tab in the Ex-
cel workbook available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms/MLE/MLE.xls and a portion reproduced below. While Ex-
cel is NOT the best tool for general maximum likelihood work, it will suffice for
our simple examples.
2010
c Carl James Schwarz 11
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
2010
c Carl James Schwarz 12
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
To find the se find the negative of the second derivative for each probability
point. Sum these to get the observed information and find the square root of the
inverse to find the the se. It is possible to have Excel compute the information,
but this is not demonstrated on the spreadsheets.
2010
c Carl James Schwarz 13
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The first example considered cases where the sampling units were the same size.
Consider the sample seen on the next page with different sized sampling units:
2010
c Carl James Schwarz 14
15
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
How should an estimate of density be found now? Simply taking the arith-
metic average seems silly as the sampling units are different sizes. Perhaps we
should standardize each observation to the same size of sampling unit and then
take the average? Or perhaps we would take the total number of dots divided
by the total sampled area? There is no obvious way to decide which is the better
method.
Using the Poisson model and MLE, the problem is not that more complex
than the earlier example. Now the data come in pairs (Yi , Ai ) where Yi is the
number of points in the sampling unit which has area Ai .
As before, let µ be the density per unit area (i.e. a sample unit of size 1).
According to the properties of the Poisson distribution, a larger sample also
follows a Poisson distribution but the Poisson-parameter must be adjusted for
the area measured:
e−Ai µ (Ai µ)Yi
P (Yi |µ, Ai ) =
Yi !
We now proceed as before. The likelihood function for each point is:
e−Ai µ (Ai µ)Yi
L(µ|Yi , Ai ) =
Yi !
We find the MLE by finding the first derivative and setting to zero:
∂ X Y1 + . . .
log(L) = Ai − =0
∂µ µ
which gives us:
Y1 + . . .
µ
b=
A1 + . . .
or the total observed objects divided by the area observed.
2010
c Carl James Schwarz 16
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The se is then
µ
se = √
b
A1 + A2 + ...
which reduces to the earlier form if all the samples are equal size.
For many couples, it is a joyful moment when they decide to try to and have
children. However, not every couple becomes immediately pregnant on the first
attempt to conceive a child and it may take many months before the woman
becomes pregnant.
2; 6; 5; 0; 0; 4; 0; 3; 10+
where the value of 2 indicates that the couple became pregnant on the 3rd
month (i.e. there were two months where the pregnancy did not occur PRIOR
to becoming pregnant on the 3rd month). The value 10+ indicates that it took
longer than 10 months to get pregnant but the exact time is unknown because
the experiment terminated.
If the exact value were known for all couples (i.e. including the last couple),
then the sample average time (in months) PRIOR to becoming pregnant would
be a simple estimator for the average number of months PRIOR to becoming
pregnant, and the intuitive estimator for the probability of becoming pregnant
2010
c Carl James Schwarz 17
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
1
on each month would be 1+Y . The extra ‘1’ in the denominator accounts for the
fact that you become pregnant in the NEXT months after being unsuccessful
for Y months. For example, if the average time PRIOR to becoming pregnant
was 4 months, then the probability of becoming pregnant in each month is
1/(4 + 1) = 0.20.
But what should be done if you have incomplete data (the last couple).
This is an example of censored data3 where there is information, but it is not
obvious how to use it. Just using the value of 10 for the last couple will lead to
an underestimate of the average time to become pregnant and an over estimate
of the probability of becoming pregnant in a month. You could substitute in a
value for the last couple before computing an average, but there is no obvious
choice for a value to use – should you use 11, 12, 15, 27, etc.?
P (Y = y|p) = (1 − p)y × p
i.e. there are y “failures to get pregnant” followed by a “success”. For censored
data, we add together the probability of becoming pregnant over all the months
greater than or equal to the censored value:
∞
X
P (Y ≥ y|p) = (1 − p)i × p
i=y
= (1 − p)y
We can now construct the likelihood function as the product of the individual
terms: Y Y
L= (1 − p)Yi p × (1 − p)Yi
non-censored censored
2010
c Carl James Schwarz 18
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
We take the first derivative and set to zero to find the point where the
likelihood is maximized:
∂ X Yi 1 X Yi
log(L) = − + + − =0
∂p (1 − p) p (1 − p)
non-censored censored
2010
c Carl James Schwarz 19
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The se is found in a similar fashion, i.e. find the second derivative of each
contribution to the likelihood, and add them together to give a measure of
information. Finally, take the square root of the inverse of the information
value to give p) = .066.
the se Ẇe obtain se(b
You may be curious to know that the commonly “accepted” value for the
probability of becoming pregnant in a month when attempting to become preg-
nant is about 25%. This value was obtained using methods very similar to what
was shown above.
For information on the current “state of the art” for these types of studies,
see:
2010
c Carl James Schwarz 20
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Notice how this differs from censoring. In censored data, the actual value
isn’t observed, but there is information on the actual value. For example, be-
cause we can’t see the back-seat of a car very well, if we see two occupants in the
front seat, we know the total number of occupants must be at least 2, i.e. could
be 2 or 3 or 4 etc. In a zero-truncated distribution, the truncated values are
simply not possible – for example, you would never see a car with 0 occupants
on the freeway!
For example, suppose a set of birds was marked and released in year 1 and
recaptures took place in years 2 and 3. The data consists of the capture history
of each bird expressed as a 3 digit vector. For example, the history 101 indicates
2010
c Carl James Schwarz 21
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
a bird that was marked and released in year 1, not seen in year 2, but recaptured
again in year 3. If birds are only marked and released in year 1, there are 4
possible capture histories that could occur: 100, 101, 110, or 111. [Because birds
are only marked in year 1, the capture histories 011, 010, 001 cannot occur.]
Suppose 100 birds were marked and released with the following summary table
of recaptures:
History nhistory
100 501
101 140
110 250
111 109
How can these capture histories provide information about the population
dynamics?
The basic problem that needs to be overcome is the less than 100% recaptures
of marked birds. For example, birds with history 101 were not recaptured in
year 2 but must have been alive because they were recaptured alive in year
3. However, birds with history 100 may have died between years 1 and 2, or
survived to year 2 and just wasn’t recaptured, with a similar ambiguity of the
status of the bird in year 3.
To begin, we need to define a probability model for the observed data. There
are two parameters of interest. Let φ be the yearly survival rate, i.e. the prob-
ability that a bird alive in year i will be alive in year i + 1; and p be the year
capture rate, i.e. the probability that a bird who is alive in year i will be recap-
tured in year i. For simplicity we will assume that the survival and recapture
rates are constant over time.
Consider first birds with capture history 111. We know for sure that this
bird survived between year 1 and year 2, was recaptured in year 2, survived
again between year 2 and year 3, and was recaptured in year 3. Consequently
the probability of this history is found as:
P (111) = φpφp
.
Similarly for history 101 we know the bird survived both years, but was only
recaptured in year 3. This gives:
P (101) = φ(1 − p)φp
Notice the (1 − p) term to account for not being seen in year 2.
The probability of history 110 is more complex. We know the bird survived
between year 1 and year 2 and was seen in year 2. We don’t know the fate of the
2010
c Carl James Schwarz 22
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
bird after year 2. It either died between year 2 and year 3, or survived between
year 2 and year 3 but wasn’t seen in year 3. This gives a probability:
The probability of history 100 is more complex yet! The probability of this
history is:
Can you explain the meaning of each of the three terms in the above expression?
Each and every marked bird released in year 1 MUST have one of the above
capture histories. A common probability model for this type of experiment is
a multi-nomial model which is the extension of the binomial model for suc-
cess/failure (2 outcomes) to this case where there are 4 possible outcomes for
each bird. The likelihood function is constructed by taking the product of the
probability of each history over all of the birds:
Y
L= P (history)
all birds
The capture-recapture tab in the Excel workbook has the sample computa-
tions and the Solver can easily maximize the log-likelihood function. The MLEs
are φb = .76 and pb = .45.
2010
c Carl James Schwarz 23
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
1.1.10 Summary
MLE can deal with a wide range of data anomalies that are difficult to deal
with in any other way such as truncation or censoring. Usually observations
are assumed to be independent of each other which makes the joint likelihood
easy to compute, but likelihood methods can also deal with non-independent
observations.
The examples above were all for discrete data. Likelihood methods for con-
tinuous data (e.g. normal, log-normal, exponential distributions) are similar
with the density function being used in place of the probability function when
constructing the likelihood.
As noted above, likelihood methods may not perform well with small sample
sizes nor with latent (hidden) variables. In these cases, Bayesian methods may
be an alternative method for parameter estimation.
2010
c Carl James Schwarz 24
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
However, all of these methods are less than satisfactory for a number of
reasons:
For these reasons, there has been a shift in emphasis in recent years from
finding the “best” model to integrating all models in making predictions.
Under the AIC paradigm, the analyst first starts with a candidate set of
models that are reasonable from a biological viewpoint. Then each model is
2010
c Carl James Schwarz 25
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
fit, and a summary measure that trades off goodness-of-fit and the number of
parameters (the AIC value) is computed for each model. The AIC values are
used to compute a model weight for each model that summarizes how much
weight should be applied to the results of this model in making predictions.
Finally, the predictions from each model are weighted by the AIC weight and
the resulting estimate and standard error incorporates both model uncertainty
and imprecision from sampling in each of the models.
The basic statistical tool for measuring fit of a model to data is the likelihood
function
L(Y ; θ)
or the logarithm of the likelihood functions:
log(L(Y ; θ))
where θ is the set of parameters used to describe the data Y . The estimates
of the parameters that maximize the likelihood or log-likelihood are known are
Maximum Likelihood Estimators (MLEs) and have nice statistical properties.
As more parameters are added to the model (e.g. more variables to a regres-
sion problem, or a time-varying capture rate is considered rather than a constant
survival rate over time), the value of the likelihood function must increase as
you can always get a better fit to data with more parameters. However, adding
parameters also implies that the total information in the data must be split over
more parameters which gives estimates with worse precision, i.e. the standard
errors of estimates get larger as more parameters are fit. Is the improvement of
fit substantial enough to require the additional parameters with the resulting
loss in precision?
AIC = −2log(L) + 2K
where K is the number of parameters in the model and log() is the natural
logarithm to the base e. [The multiplier 2 is included for historical reasons.]
2010
c Carl James Schwarz 26
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The ‘optimal’ tradeoff between fit and the number of parameters occurs with
the smallest value of AIC among models in our model set.
The above equation for AIC can be modified slightly to account for small
sample sizes (leading to AICc ) or for a general lack of fit of any model (leading
to QAIC or QAICc ). These details are not explored in this overview, but the
same general principles are applicable.
While the ‘optimal’ model (among those in the model set) is the one with
the lowest AIC, there may be several models that differs from the next-lowest
by only a small amount? How much support is there for selecting one model
over the other? Notice the use of the word support, rather than statistical
significance. Anderson and Burnham (2002) and Anderson (2008) recommend
several rules of thumb to select among models, based on differences in AIC.
The difference in AIC between a specific model and the best fitting model is
denoted as ∆AIC. By definition, ∆AIC = 0 for the best fitting model. When
the difference in AIC between 2 models (∆AIC) is less than 2 units, then one
is reasonably safe in saying that both models have approximately equal weight
in the data. If 2 < ∆AIC < 7, then there is considerable support for a real
difference between the models, and if ∆AIC > 7, then there is strong evidence
to support the conclusion of differences between the models.
2010
c Carl James Schwarz 27
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
together using the AIC weights in a weighted average. For example, suppose
there are R models in the candidate set, and let θbi represent the estimate from
the ith model. Then the model averaged estimate is found as:
R
X
θbavg = wi θbi
i=1
Buckland et al. (1997) also showed how to estimate a standard error for
this averaged estimate that includes both the standard error from each of the
candidate models (i.e. sampling uncertainty for each model), and the variation
in the estimate among the candidate models (i.e. model uncertainty):
R
X q
se
b avg = wi se2 (θbi ) + (θbi − θbavg )2
i=1
√
The first component in the ∗ sign refers to the standard error of each
estimate for a particular model; the second component refers to variation in the
estimates around the model averaged estimate.
What about statistical significance - where is the p-value for this model?
The point Anderson and Burnham (2002) and Anderson (2008) make at this
stage is to suggest that this sort of question represents misplaced focus. Instead,
they suggest we should place greater emphasis on the effect size (the magnitude
of the difference in estimates between models), than on significance levels. This
is analogous to the arguments in favor of using confidence intervals in lieu of
p-values in ecology.
2010
c Carl James Schwarz 28
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The study was conducted over 7 years with both males and females marked.
Here is a sample of the raw data in capture-history format. There were a total
of 294 birds that were marked.
For example, the history 1101110 indicates the female bird was captured in
year 1, had a band applied to its leg and released, It was recaptured (alive) in
year 2, not seen in year 3, recaptured (alive) in years, 4, 5, 6 and not seen after
year 6. However, the fate in year 7 is unknown – the bird either died between
years 6 and 7, or it was alive in year 7 and not recaptured – there is no way to
know exactly what happened.
The probability model to describe this data has 4 sets of parameters: a set
of yearly survival rates for the male birds, a set of yearly survival rates for
the female birds, a set of yearly recapture rates for males, and a set of yearly
recapture rates for females. There are several biological hypotheses that can be
2010
c Carl James Schwarz 29
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
examined in a series of models, and interest lies in which model best describes
the data with estimates of survival and recapture. For example, perhaps males
and females have the same yearly survival rates but the rates differ across years.
Or males and females have different survival rates, but the survival rate for each
sex is constant over time. Similarly, there are multiple models for the capture
rates.
Model Interpretation
phi(t), p(g) Survival is same for males and females, but the common
survival rate varies across time. Recapture rates are equal
across time, but vary between the groups.
phi(g ∗ t), p(g) Survival is different for every combination of group (sex)
and time, i.e. no two values are equal Recapture rates are
equal across time, but vary between the groups.
phi(g), p(g ∗ t) Survival is different for males and females (a group effect)
but constant over time. Recapture rates are different for
every combination of group (sex) and time.
phi(.), p(g ∗ t) Survival is the same for both males and females (no group
effect) and constant over time. Recapture rates are differ-
ent for every combination of group (sex) and time.
Model Interpretation
phi(g ∗ Survival is different between males and females, and each
f lood), p(g) sexes survival rate is equal in across the non-flood or flood
years, but differs between flood and non-flooded years. Re-
capture rates are equal across time, but vary between the
groups.
2010
c Carl James Schwarz 30
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Models considered for the Dipper study
Model
Phi(t) p(t)
Phi(g*t) p(g*t)
Phi(g) p(g)
Phi(.) p(.)
Phi(Flood) p(.)
Phi(Flood) p(Flood)
Phi(.) p(g)
Phi(g) p(.)
Phi(.) p(Flood)
Phi(t) p(.)
Phi(t) p(g)
Phi(.) p(t)
Phi(g) p(t)
Phi(g*t) p(.)
Phi(g*t) p(g)
Phi(.) p(g*t)
Phi(t) p(g*t)
Phi(g*t) p(t)
Phi(g) p(g*t)
We showed earlier how the probability of each capture history can be written
in terms of the parameters φgt and pgt – you should select a few models to write
out some of the histories. You can also get a sense for the models by constructing
a diagram showing the relationship among the parameters (known among users
of the computer program MARK as the PIMs). This will be demonstrated in
class.
Note that biologically speaking, only the model P hi(g ∗ t)p(g ∗ t) could pos-
sibly be true! For example a phi(g) model would say that the survival rates
differ between males and females but are equal across all times. It is logically
impossible that the survival rate in year 1 would be equal to the survival rate
in year 2 to 40 decimal places! So why would be fit such logically impossible
models? The AIC paradigm says that ALL models are wrong, but some models
may closely approximate reality. All else being equal, simpler models are pre-
ferred over more complex models because the uncertainty in the estimates must
be smaller. But even this model likely is wrong, because it (implicitly) assumes
that all birds of the same sex have the same survival rate in each year which
likely isn’t true due to innate difference in fitness among individuals. The real
world is infinitely complex and can’t possibly be captured by simple models,
but we hope that our models are close approximations.
All of the two models were fit using Maximum Likelihood using Program
MARK. A summary of the results is presented below:
2010
c Carl James Schwarz 31
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Num. Delta AICc Relative
Model -2log(L) Par AICc AICc Weights Likelihood
{Phi(Flood) p(.) } 660.10 3 666.16 0.00 0.61 1.00
{Phi(Flood) p(Flood) } 660.06 4 668.16 2.00 0.23 0.37
{Phi(.) p(.) } 666.84 2 670.87 4.71 0.06 0.10
{Phi(.) p(g) } 666.19 3 672.25 6.09 0.03 0.05
{Phi(g) p(.) } 666.68 3 672.73 6.57 0.02 0.04
{Phi(.) p(Flood) } 666.82 3 672.88 6.72 0.02 0.03
{Phi(t) p(.) } 659.73 7 674.00 7.84 0.01 0.02
{Phi(g) p(g) } 666.15 4 674.25 8.09 0.01 0.02
{Phi(t) p(g) } 659.16 8 675.50 9.34 0.01 0.01
{Phi(.) p(t) } 664.48 7 678.75 12.59 0.00 0.00
{Phi(t) p(t) } 656.95 11 679.59 13.43 0.00 0.00
{Phi(g) p(t) } 664.30 8 680.65 14.49 0.00 0.00
{Phi(g*t) p(.) } 658.24 13 685.12 18.96 0.00 0.00
{Phi(g*t) p(g) } 657.90 14 686.92 20.76 0.00 0.00
{Phi(.) p(g*t) } 662.25 13 689.13 22.97 0.00 0.00
{Phi(t) p(g*t) } 654.53 17 690.03 23.87 0.00 0.00
{Phi(g*t) p(t) } 655.47 17 690.97 24.81 0.00 0.00
{Phi(g) p(g*t) } 662.25 14 691.27 25.11 0.00 0.00
{Phi(g*t) p(g*t) } 653.95 22 700.46 34.30 0.00 0.00
First consider the number of parameters in for each model. For model
{phi(.)p(.)}, there are two parameters - the survival rate (φ) that is common to
both sexes and equal across all years and a similar parameter for the recapture
rates. For model {phi(.)p(F lood)}, there are three parameters: the common
survival rate, and the two recapture rates for flood vs. non-flood years each
of which is the same across both sexes and the respective years. For model
{phi(t), p(.)}, there are 7 parameters: 6 survival rates (between years 1 and 2,
2 and 3, . . ., 6 and 7) that are common to both sexes, and a single recapture
rate that is common to both sexes and all years.
2010
c Carl James Schwarz 32
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
As models become more complex (i.e. more parameters), they must fit better.
The log-likelihood must increase. But the table (for historical reasons) reports
−2log(L) which implies that −2log(L) decreases as models get more complex.
But as models become more complex (with more parameters) the same amount
of information is split among more and more estimates leading to estimates with
worse precision (i.e. larger standard errors. For example, compare the estimates
from models with and without the effects of the flood years on the survival rates:
Notice that the se of the survival rate estimates in the simpler model are
smaller (i.e. more precise) than in the more complex model.
In general, models with a small AICc are “better” in the tradeoff between
model fit and model complexity. The AICc (the regular AIC corrected for small
sample sizes) incorporates both model complexity (number of parameters) and
model fit (likelihood). Look at the AICc for the top two models. The fit is
almost the same (−2log(L)) is virtually identical, but the second model has
one more parameter. Consequently, the AICc “penalizes” the second model
for being unnecessarily complex and the AICc is larger because −2log(L) is
virtually the same + 2× the number of parameters, and after the small sample
adjustment), resulting in almost a 2 unit increase in the AICc.
The Delta AICc column measures how must worse the other models in the
model set are, relative to the “best” model in the model set. A rule of thumb
is that differences of less than 3 or 4 units indicate two models of essentially
equivalent fit. The reason for the “essentially equivalent” fit is that the model
fits (and rankings) will change if the experiment were to be repeated (as any
statistics computed on data must). Accordingly, the first 2 models are not really
distinguishable.
2010
c Carl James Schwarz 33
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
relative to the models in the model set. It indicates that most of the weight
should be placed on the the first two models, with some minor weighting placed
on the next 3 or 4 models, but virtually no weight on the remaining models.
In the AIC paradigm there are NO p-values for choosing between models,
and NO model selection trying to find the single best model. The AIC paradigm
recognizes that ALL of the models presented here are likely wrong (even the most
complex model assumes that all animals have the same survival rate within
a sex year combination which is likely untrue because different animals have
different “fitness”. Consequently, trying to select the single “best” model is a
fools paradise! So “testing” if there is a flood effect isn’t done – after all do you
really expect absolute NO effect (to 40 decimal places) of flooding on survival
rates?
2010
c Carl James Schwarz 34
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
First you need to understand how you can get a survival rate for males in
year 1 from each of the models when sex and/or year does not appear as a model
effect? Consider the first model where there is a flood effect on survival but no
sex effect. Then the survival rate in year 1 for males will the the estimated
survival rate for non-flood years (which is common between the sexes). In
model with phi(.), the estimate comes from the estimated survival rate which
is common among all year and both sexes etc.
Notice that the estimated survival rates for males between year 1 and year
2 varies considerably among models as does the standard error. The AIC
paradigm constructs a weighted average of the estimates from each model which
is reported at the bottom of the page. Because the model weights are close to
100% in total for the first few models, the final estimate must be close to the
estimates from these models.
The model averaged se is not a simple weighted average (see the previous
notes) but follows the same philosophy – models with higher AICc weights
contribute more to the model averaged values. The Unconditional SE line adds
an additional source of uncertainty – that from the models them selves. Notice
that the estimates vary considerably about the model averaged value of .60.
The effect of the different models (after weighting by the AICc weight) is also
incorporated. In this case, the models with vastly different estimates have little
weight, so the unconditional se is very similar (but slightly larger) than the
simple model averaged se.
Now consider the model averaged survival rates for males between year 2
and 3 (a flood year).
2010
c Carl James Schwarz 35
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
Some models give the same estimated survival between year 1 and year 2
(non-flood, previous table) and between year 2 and year 3 (flood) - why? The
model averaging continues in the same way. Notice that the unconditional se is
now much larger than the conditional se because there is more variation in the
top models in the estimated survival rates.
The model averaged values for males and females can be computed for all
time periods:
Note the use of “apparent survival” because a bird that dies or permanently
leaves the study area cannot be distinguished. Also notice the odd results for
the apparent survival rate between years 6 to 7. The se and confidence intervals
are very large because this parameter cannot be estimated in some model (see
above). When using model averaging, you must be careful to only average
estimates that are comparable and identifiable in all models.
2010
c Carl James Schwarz 36
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
The final model averaged values for males and females can be plotted (along
with a 95% confidence interval):
Along with the model fitting above, it is important to conduct through model
assessments to ensure that even the best fitting model is reasonably sensible.
This is not covered in this brief review – refer to the vast literature on capture-
recapture models for assistance.
When using the AIC paradigm it is important to specify the model set
BEFORE the analysis begins to avoid data dredging, and the model set should
be comprehensive to include all models of biological interest. Nevertheless, it is
important not to simply throw in all possible models and mechanically use this
procedure – each model in the model set should have a biological justification.
The program MARK has and extensive implementation of the AIC paradigm
2010
c Carl James Schwarz 37
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
for use in capture-recapture studies. The AIC paradigm is the accepted standard
for publishing work that involves capture-recapture experiment. If you submit
a manuscript involving the use of capture-recapture methods and do not use
AIC, it will likely be returned unread and unreviewed.
This paper illustrates the use of AIC methods in modern wildlife manage-
ment research. In this article the authors use capture-recapture methods to
study the survival of juvenile pygmy rabbits in east-central Idaho, US. In their
study, they attached radio-tags to newly born rabbits and then followed rabbits
every 3 or 4 days to see if the rabbit was alive or dead.
The known-fate model that is fit takes into account that the animal is
detected with 100% probability (because of the radio tracking) and if the animal
dies, the time of death is known (to within 3 days). Consequently, the only
known survival paraemters are the weekly survival rates. This differs from
many capture-recapture experiments where detectability is less than 100% and
you must estimate both survival and detection.
The survival rate for the first week is .94 = 47/50. This means that there
were 47 alive animals at the start of week 2. However, only 45 could be located (2
2010
c Carl James Schwarz 38
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
radios could have died). The animals with radio tags that could not be located at
the start of week 2 are censored, their fate is known. The KM method computes
the survival rate to the end of week 2 as 42/45. The cumulative survival rate
47
over the first 2 weeks is computed as 50 × 4245 . Because 3 animals died by the
end of week 2, there are 42 animals alive and all radios were located at the start
of week 2, so the number of animals alive is now 42. The cumulative survival
to the end of week 3 is computed as: 47 42 40
50 × 45 × 42 . The table and the KM
estimates can be extended to the end of the study in a similar fashion.
There were several potential predictor variables for the weekly survival rates
as outlined in the METHODS section. Based on biologically reasonable grounds,
a set of 14 a priori models was constructed (the model set). Examine the model
set in Table 1 – be sure you understand the differences among the models and
the biological interpretation of the models. In particular what does a model
YEAR × AREA mean? What does the model Constant survival mean?
etc. What does the model with the effect of BORN mean?
ML estimates were found for each model, and AIC was used to rank the
relative fit and complexity for the models in the model set (Table 1). Be sure
you understand the number of parameters in each model.
What is meant by the sentence in the report “A set of 9 models was included
in the top model set . . . indciated relatively high model uncertainty.” (First
paragraph under Figure 1).
Understand how Table 2 was computed and how to interpret the table.
An understanding of how the table was computed will have to be conceptual
because the authors have omitted many details from the paper such as the
fact that survival was modelled on the logit scale, the covariate for BORN was
standardized automatically by MARK etc. As such, the actual numbers in Table
2 are pretty much useless for actual hand computations (!).
Look at Figure 3. The bar suggest that the effect of year is about the same
in both areas except translated upwards. What “model” does this suggest? Did
this model rank high in Table 1? [Don’t forget to take into account the size of
the se shown in the plot.]
Notice that there are NO p-values in the entire paper, and every estimate
has an associated measure of precision (a se).
2010
c Carl James Schwarz 39
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
1.3 References
Anderson, D.R. (2008). Model Based Inference in the Life Sciences: A Primer
on Evidence. Springer, New York.
Lebreton, J.-D., Burnham, K.P., Clobert, J. & Anderson, D.R. (1992). Model-
ing survival and testing biological hypotheses using marked animals: a unified
approach with case studies. Ecological Monographs, 62, 67-118.
2010
c Carl James Schwarz 40