Sampling, Regression, Experimental Design and Analysis For Environmental Scientists, Biologists, and Resource Managers

Sampling, Regression, Experimental Design and
Analysis for Environmental Scientists,

Biologists, and Resource Managers
C. J. Schwarz
Department of Statistics and Actuarial Science, Simon Fraser University
cschwarz@stat.sfu.ca
January 16, 2011

Contents
1 Maximum Likelihood Estimation & AIC and model selection -

a primer 2
1.1 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . 2
1.1.1 The probability model . . . . . . . . . . . . . . . . . . . . 3
1.1.2 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 The MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 The precision of the MLE . . . . . . . . . . . . . . . . . . 10
1.1.5 Numerical optimization . . . . . . . . . . . . . . . . . . . 11
1.1.6 Example2: Different sizes of sampling units. . . . . . . . . 14
1.1.7 Example: What is the probability of becoming pregnant
in a month? . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.8 Example: Zero-truncated distribution . . . . . . . . . . . 21
1.1.9 Example: Capture-recapture with 2 parameters . . . . . . 21
1.1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2 AIC and Model Selection . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.2 AIC and regression . . . . . . . . . . . . . . . . . . . . . . 28
1.2.3 Example: Capture-recapture . . . . . . . . . . . . . . . . 29
1.2.4 Example: Survival of Juvenile Pygmy Rabbits . . . . . . 38
1.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1
Chapter 1
Maximum Likelihood
Estimation & AIC and model
selection - a primer
1.1 Maximum Likelihood Estimation (MLE)
These notes provide a (very) brief introduction to Maximum Likelihood Esti-

mation (MLE) at a lower technical level than would be found in a formal course
in Statistics. The important part of these notes are the ideas – don’t worry too
much about the technical aspects unless you wish to use maximum likelihood
for a non-standard problem. In the latter case, please don’t hesitate to contact
me to discuss your problem.
A review of MLE is found at http://en.wikipedia.org/wiki/Maximum_

likelihood.
Maximum Likelihood Estimation (MLE) was first developed by Sir R. A.

Fisher a famous geneticist and statistician. It is closely related to the con-
cept of probability and provides a comprehensive justifiable method to estimate
parameters from data. It is a standard tool used by statisticians for many
problems.
The key advantages of MLE are that:
• a unified framework to obtain estimators.
2
CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND
MODEL SELECTION - A PRIMER
• in large samples is fully efficient, i.e. extracts all the information possible
from data.
• provides a standard way to obtain estimates of precision (se) for any data
collection scheme along with a way to estimate confidence intervals.
However, MLE is not a panacea and there are several “problems” with MLE
that may require other methods (such as Bayesian methods):
• The properties of MLE in small samples may not be optimal nor exact
• The likelihood function may not be computationally tractable. This often

arises because the likelihood has to integrate over hidden variables and
the likelihood is a multi-dimensional integral.
1.1.1 The probability model
The first step in using MLE is to define a probability model for your data. This is
crucial because if your probability model is wrong, then all further inference will
also be incorrect. Statisticians recognize that all probability models are wrong,
but hope that they are close enough to reality to be useful. Consequently an
important part of MLE is model assessment – how well does the model fit the
data. This won’t be covered in this primer – contact me for details.
For example, consider the problem of estimating the density of objects (grey
dots) in a region as shown on the next page
2010
c Carl James Schwarz 3
4
Of course in real life the locations of all the objects would be unknown
because if the locations (and number of objects) were known, then it would be a
simple matter to count all of the dots and divide by the study area to obtain the
density value! Consequently, three small areas were randomly selected from the
area (how would this be done?) and the number of objects within the sampling
quadrats are counted as shown on the next page:
2010
6
This gives three counts, 4, 10, and 15. Note that the sampling protocol has
to be explicit on what to do with objects that intersect the boundary of the
sampling frame – this will not be discussed further in these notes.
A non-parametric way to proceed would be to simply take the average count

( 4+10+15
3 = 9.67) and report this as the density per sampling unit.1 However,
this simple approach will fail (as seen later) in more complex cases where the
data are not a straightforward.
A MLE approach would start with a model for the data. In many cases,
counts of objects follow a Poisson distribution. This would be true if objects
occurred “at random” on the landscape. It is not always true – for example,
objects could tend to fall in clusters (new plants growing from seeds from a
parent plant), or objects could be more spread out (competition for resources)
than random. In both cases, a more complex probability model could be fit. If
a simple Poisson model is used, an important part of the MLE process is model
assessment as noted earlier.
A Poisson distribution depends upon the parameter µ which (in this case) is
the mean number of dots per sampling unit or the true density in the population.
Let Y represent the random variable for the number of objects found in each
sample. If the value of µ is known, then the probability of seeing Y = y objects
2
in a sample follows the Poisson probability law:
e−µ µy
P (Y = y|µ) =
y!
e−1 12
For example, if µ = 1, then P (Y = 2|µ = 1) = 2! = .184.
Of course, the value of µ is NOT known and this is the parameter that we
would like to estimate from our sample of size 3.
1.1.2 The likelihood
MLE starts by defining the likelihood of each data value.

e−µ µy
L(µ|Y = y) =
y!
The likelihood looks familiar and is in fact the same probability function but
with the roles of Y and µ are reversed. A probability function gives the proba-
bility of observing Y given the parameter µ and sums to 1 over all the possible
1 This estimator relies on the sampling being a simple random sample from the population
is a design based estimator as covered in the chapter on Sampling Theory.

2 The notation Y = y is read as the random variable Y taking the value y. For example,
Y = 1 would imply that the number of objects in the sample was 1.
2010
values of Y . The likelihood function is a function of µ given the data and does
NOT sum to 1 over the possible values of µ. [Note that while the possible values
of Y were 0, 1, 2, 3, . . ., the possible values are µ is the non-negative real line,
i.e. 0 ≤ µ.]
We will assume that each of three samples were selected independently of

each other. [If this were not true, then the likelihood function below will be
more complex.] It seems reasonable that the density is homogeneous over the
entire region (which again can be relaxed if needed). Let Yi represent the count
in the ith sample with i = 1, 2, 3. Because of the independence assumption
and the assumption of a common parameter to all samples, the joint likelihood
function is the product of the individual likelihood functions:
e−µ µy1 e−µ µy2 e−µ µy3

L(µ|Y1 = y1 , Y2 = y2 , Y3 = y3 ) = × ×
y1 ! y2 ! y3 !
This could be written more compactly using product notation but is not impor-
tant for what follows.
So if our sample values were 4, 10 and 15, the likelihood function is:
e−µ µ4 e−µ µ10 e−µ µ15

L(µ|Y1 = 4, Y2 = 10, Y3 = 15) = × ×
4! 10! 15!
and is function only of µ, i.e. the only unknown quantity is the parameter to be
estimated.
1.1.3 The MLE
Fisher argued that a sensible estimate for µ would be the value that MAXI-
MIZES the LIKELIHOOD (that is why this procedure is called Maximum Like-
lihood). This intuitively says that the best guess for the unknown parameter
maximizes the joint probability of the data and is the most consistent with the
data.
The MLE could be found using graphical methods, i.e. plot L as a function
of µ and find the value of µ that maximizes the curve as shown below:
2010
However, simple calculus gives us a way to do this without resorting to the

looking a graphs.
Recall from calculus that the maximum of a function often occurs when
then derivative (the slope of the line is zero). As the function increases to
the maximum the derivate is positive; as the function decreases away from the
maximum, the slope is negative. It switches from a positive slope to a negative
slope at the maximum.
It turns out that (for mathematical reasons), it is more convenient to look

at the log(likelihood) where log() is the natural logarithm to the base e. In our
2010
problem:
log(L) =log(L(µ|Y1 , Y2 , Y3 )) = −µ + Y1 log(µ) − log(Y1 !)+
− µ + Y2 log(µ) − log(Y2 !) + −µ + Y3 log(µ) − log(Y3 !)
= − 3µ + (Y1 + Y2 + Y3 )log(µ) − log(Y1 !) − log(Y2 !) − log(Y3 !)
Because the log() function is a monotonic transform, the maximum of log(L)

will be coincident with the maximum of L.
To find the maximum of log(L), find the first derivative with respect to µ,
equate to 0, and solve for µ:
∂ Y1 + Y2 + Y3
log(L) = −3 + =0
∂µ µ
or (after rearranging the above we have)
Y1 + Y2 + Y3
µ
b= =Y
3
or the sample average. The circumflex over the parameter is the usual conven-
tion to indicate that the value is an estimator derived from data rather than a
known parameter value, i.e. µb is a best-guess for µ.
Whew! This seems like a lot of work to get an “obvious” result, but the
advantages of MLE will be come clearer in future examples.
4+10+15
So in our example, µ
b= 3 = 9.67.
1.1.4 The precision of the MLE
But, never report a “naked estimate”, i.e. it is necessary to attach a measure of

precision (the standard error) to this estimate. MLE gives us a way to do this
as well.
It turns out that a measure of “information” in the data is defined as the

negative of the second derivative of the log-likelihood function (caution: some
more calculus ahead) and the standard error of the MLE can be found from the
inverse of the information (caution: in more complex problems this needs an
inverse of a matrix).
The second derivative is found by taking the derivative of the derivative of

the log-likelihood function. In our case, a measure of information is found as:
∂2 Y1 + Y2 + Y3
I=− log(L) =
∂µ∂µ µ2
2010
This doesn’t seem too helpful, as it depends upon the value of µ, so the
observed information is found by substituting in the value of the MLE:
Y1 + Y2 + Y3
OI =
b2
µ
which after some arithmetic gives
3
OI =
µ
b
Almost there! The se of an estimator is found as the square-root of the

inverse of the information matrix or:
p
µ
µ) = √
b
se(b
n
√
9.67
In our case, we have se = √
3
= 1.80
This looks a bit odd compared to the familiar se(Y ) = √sn for samples taken
from a simple random sample. However one of the properties of the Poisson
√
distribution is that the standard deviation of the data is equal to the µ so in
fact the se of the MLE is not that different.
Once the se is found, then confidence intervals can be found in the usual
way, i.e. an approximate 95% confidence interval is µ b ± 2se. There are other
ways to find confidence intervals using the likelihood function directly (called
profile intervals) but these are not covered in this review.
Don’t forget that model assessment must be performed to ensure that our
choice of a Poisson distribution is sensible probability model!
1.1.5 Numerical optimization
The above example could be solved analytically to give a closed-form solution.

In many problems this is not possible – in fact the vast majority of problems do
NOT have closed form solutions.
However, finding the MLE, the information matrix, and the se can be done
using numerical methods. For example, consider the PoissonEqual tab in the Ex-
cel workbook available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms/MLE/MLE.xls and a portion reproduced below. While Ex-
cel is NOT the best tool for general maximum likelihood work, it will suffice for
our simple examples.
2010
The key steps are to
• Set up a area for the raw data
• Set up a cell for the parameter of interest.

• Use the built-in functions (e.g. the Poisson function in Excel) to find the
probability of each data value for an (arbitrary) value of the parameter.
The initial value chosen is not that important.
• Find the log() of each probability (be sure to use the correct logarithm
function, i.e. the ln() function in Excel.
• Sum the log() of each probability to find the total log-likelihood.
The log-likelihood function can then be maximized by changing the value of

the parameter and seeing which values give the maximum of the log-likelihood.
Note that the log-likelihood is often negative, so the maximum is the value
closest to zero (i.e. −2 is larger than −4) and not the value with a larger absolute
value. This can be automated by using the Solver feature of Excel.
2010
To find the se find the negative of the second derivative for each probability
point. Sum these to get the observed information and find the square root of the
inverse to find the the se. It is possible to have Excel compute the information,
but this is not demonstrated on the spreadsheets.
2010
1.1.6 Example2: Different sizes of sampling units.
The first example considered cases where the sampling units were the same size.
Consider the sample seen on the next page with different sized sampling units:
2010
15
How should an estimate of density be found now? Simply taking the arith-
metic average seems silly as the sampling units are different sizes. Perhaps we
should standardize each observation to the same size of sampling unit and then
take the average? Or perhaps we would take the total number of dots divided
by the total sampled area? There is no obvious way to decide which is the better
method.
Using the Poisson model and MLE, the problem is not that more complex
than the earlier example. Now the data come in pairs (Yi , Ai ) where Yi is the
number of points in the sampling unit which has area Ai .
As before, let µ be the density per unit area (i.e. a sample unit of size 1).
According to the properties of the Poisson distribution, a larger sample also
follows a Poisson distribution but the Poisson-parameter must be adjusted for
the area measured:
e−Ai µ (Ai µ)Yi
P (Yi |µ, Ai ) =
Yi !
We now proceed as before. The likelihood function for each point is:
e−Ai µ (Ai µ)Yi
L(µ|Yi , Ai ) =
Yi !
The likelihood function for ALL the data points is:

Y e−Ai µ (Ai µ)Yi
L(µ|Yi , Ai , i = 1, . . . , n) =
Yi !
The log-likelihood function for ALL of the data points is:

X
log(L) = −Ai µ × Yi log(Ai µ) − log(Yi !)
We find the MLE by finding the first derivative and setting to zero:
∂ X Y1 + . . .
log(L) = Ai − =0
∂µ µ
which gives us:
Y1 + . . .
µ
b=
A1 + . . .
or the total observed objects divided by the area observed.
We find the se by finding the measure of information using the negative of

the second derivative of the log-likelihood:
∂2 Y1 + Y2 + . . .
I=− log(L) =
∂µ∂µ µ2
2010
and upon some arithmetic substituting in the MLE for µ we get:

A1 + A2 + . . .
OI = −
b2
µ
which upon first inspection looks different from the results from the previous
example, but reduces to the same form if all the A1 are equal sized and one unit
in size.
The se is then
µ
se = √
b
A1 + A2 + ...
which reduces to the earlier form if all the samples are equal size.
The PoissonDifferent tab in the Excel worksheet from http://www.stat.

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/MLE/MLE.xls repeats the anal-
ysis using numerical methods for this case.
1.1.7 Example: What is the probability of becoming preg-

nant in a month?
For many couples, it is a joyful moment when they decide to try to and have
children. However, not every couple becomes immediately pregnant on the first
attempt to conceive a child and it may take many months before the woman
becomes pregnant.
Fertility scientists are interested in estimating the probability of becoming

pregnant in a given month. A sample of couples are enrolled in a study and
each couple records the number of months prior to becoming pregnant. Here is
the raw data:
2; 6; 5; 0; 0; 4; 0; 3; 10+
where the value of 2 indicates that the couple became pregnant on the 3rd
month (i.e. there were two months where the pregnancy did not occur PRIOR
to becoming pregnant on the 3rd month). The value 10+ indicates that it took
longer than 10 months to get pregnant but the exact time is unknown because
the experiment terminated.
If the exact value were known for all couples (i.e. including the last couple),
then the sample average time (in months) PRIOR to becoming pregnant would
be a simple estimator for the average number of months PRIOR to becoming
pregnant, and the intuitive estimator for the probability of becoming pregnant
2010
1
on each month would be 1+Y . The extra ‘1’ in the denominator accounts for the
fact that you become pregnant in the NEXT months after being unsuccessful
for Y months. For example, if the average time PRIOR to becoming pregnant
was 4 months, then the probability of becoming pregnant in each month is
1/(4 + 1) = 0.20.
But what should be done if you have incomplete data (the last couple).
This is an example of censored data3 where there is information, but it is not
obvious how to use it. Just using the value of 10 for the last couple will lead to
an underestimate of the average time to become pregnant and an over estimate
of the probability of becoming pregnant in a month. You could substitute in a
value for the last couple before computing an average, but there is no obvious
choice for a value to use – should you use 11, 12, 15, 27, etc.?
This problem is amenable to Maximum Likelihood Estimation. A common

probability distribution to model this type of data is the geometric distribution
with parameter p representing the probability of becoming pregnant in any
month.
Let Y be the number of months PRIOR to becoming pregnant. Then
P (Y = y|p) = (1 − p)y × p
i.e. there are y “failures to get pregnant” followed by a “success”. For censored
data, we add together the probability of becoming pregnant over all the months
greater than or equal to the censored value:
∞
X
P (Y ≥ y|p) = (1 − p)i × p
i=y
which after some algebra reduces to
= (1 − p)y
We can now construct the likelihood function as the product of the individual
terms: Y Y
L= (1 − p)Yi p × (1 − p)Yi
non-censored censored
The log-likelihood is then:

X X
log(L) = Yi log(1 − p) + log(p) + Yi log(1 − p)
3 Another example of censored data are water quality readings where the concentration of
a chemical is below the detection limit and the detection limit provides an upper bound on
the actual concentration.
2010
We take the first derivative and set to zero to find the point where the
likelihood is maximized:
∂ X Yi 1 X Yi
log(L) = − + + − =0
∂p (1 − p) p (1 − p)
If there were no censoring, the above equation can be solved explicitly to

give
1
pbif no censoring =
Y +1
which has a nice interpretation as the average months prior to becoming preg-
nant + 1 month when you became pregnant.
Unfortunately, in the presence of censoring, there is NO explicit solution

and the MLE MUST be solved numerically. The Pregnant tab in the Ex-
cel workbook available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms/MLE/MLE.xls has an example of the sample computations
and high-lights are shown below.
Again the solver can be used to do the optimization:
2010
The MLE is pb = .21.
The se is found in a similar fashion, i.e. find the second derivative of each
contribution to the likelihood, and add them together to give a measure of
information. Finally, take the square root of the inverse of the information
value to give p) = .066.
the se Ẇe obtain se(b
You may be curious to know that the commonly “accepted” value for the
probability of becoming pregnant in a month when attempting to become preg-
nant is about 25%. This value was obtained using methods very similar to what
was shown above.
For information on the current “state of the art” for these types of studies,
see:
Scheike, T.H. and Keiding, N. (2006) Design and analysis of time-

to-pregnancy. Statistical Methods in Medical Research, 15, 127-140.
http://dx.doi.org/10.1191/0962280206sm435oa
2010
1.1.8 Example: Zero-truncated distribution
The Poisson distribution is a popular distribution used to model smallish counts.

In some cases, only positive counts can be observed, i.e. you can’t observe a 0
value. This is known as a zero-truncated Poisson distribution. For example,
the number of occupants in a car on a freeway can be closely modeled by a
zero-truncated Poisson distribution because you can’t observe cars on the road
with 0 occupants!
Notice how this differs from censoring. In censored data, the actual value
isn’t observed, but there is information on the actual value. For example, be-
cause we can’t see the back-seat of a car very well, if we see two occupants in the
front seat, we know the total number of occupants must be at least 2, i.e. could
be 2 or 3 or 4 etc. In a zero-truncated distribution, the truncated values are
simply not possible – for example, you would never see a car with 0 occupants
on the freeway!
The probability distribution for a zero-truncated Poisson distribution (Y >

0) is:
e−λ λY
Y!
P (Y = y|λ) =
1 − e−λ
which is a regular Poisson distribution on the top but adjusted in the denom-
inator by the probability of observing a Poisson with 1 or more counts. This
ensures that the probability still adds to 1 when summed over possible values
of Y.
The likelihood development is straightforward and not detailed here. There

are no closed for solutions for the MLE and numerical methods must be used.
The ZeroTruncatedPoisson tab in the Excel workbook available at: http://

www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/MLE/MLE.xls demon-
strates how to find the MLE numerically.
1.1.9 Example: Capture-recapture with 2 parameters
Capture-recapture is a common method to study population dynamics of wildlife.

Animals are marked (usually with individually numbered tags) and released
back into the population. Subsequent recaptures of marked animals provides
information about survival rates, movement rates, etc.
For example, suppose a set of birds was marked and released in year 1 and
recaptures took place in years 2 and 3. The data consists of the capture history
of each bird expressed as a 3 digit vector. For example, the history 101 indicates
2010
a bird that was marked and released in year 1, not seen in year 2, but recaptured
again in year 3. If birds are only marked and released in year 1, there are 4
possible capture histories that could occur: 100, 101, 110, or 111. [Because birds
are only marked in year 1, the capture histories 011, 010, 001 cannot occur.]
Suppose 100 birds were marked and released with the following summary table
of recaptures:
History nhistory
100 501
101 140
110 250
111 109
How can these capture histories provide information about the population
dynamics?
The basic problem that needs to be overcome is the less than 100% recaptures
of marked birds. For example, birds with history 101 were not recaptured in
year 2 but must have been alive because they were recaptured alive in year
3. However, birds with history 100 may have died between years 1 and 2, or
survived to year 2 and just wasn’t recaptured, with a similar ambiguity of the
status of the bird in year 3.
To begin, we need to define a probability model for the observed data. There
are two parameters of interest. Let φ be the yearly survival rate, i.e. the prob-
ability that a bird alive in year i will be alive in year i + 1; and p be the year
capture rate, i.e. the probability that a bird who is alive in year i will be recap-
tured in year i. For simplicity we will assume that the survival and recapture
rates are constant over time.
Consider first birds with capture history 111. We know for sure that this
bird survived between year 1 and year 2, was recaptured in year 2, survived
again between year 2 and year 3, and was recaptured in year 3. Consequently
the probability of this history is found as:
P (111) = φpφp
.
Similarly for history 101 we know the bird survived both years, but was only
recaptured in year 3. This gives:
P (101) = φ(1 − p)φp
Notice the (1 − p) term to account for not being seen in year 2.
The probability of history 110 is more complex. We know the bird survived
between year 1 and year 2 and was seen in year 2. We don’t know the fate of the
2010
bird after year 2. It either died between year 2 and year 3, or survived between
year 2 and year 3 but wasn’t seen in year 3. This gives a probability:
P (110) = φp(1 − φ + φ(1 − p))
Note the two possible outcomes after the recapture in year 2.
The probability of history 100 is more complex yet! The probability of this
history is:
P (100) = (1 − phi) + φ(1 − p)(1 − φ) + φ(1 − p)φ(1 − p)
Can you explain the meaning of each of the three terms in the above expression?
Each and every marked bird released in year 1 MUST have one of the above
capture histories. A common probability model for this type of experiment is
a multi-nomial model which is the extension of the binomial model for suc-
cess/failure (2 outcomes) to this case where there are 4 possible outcomes for
each bird. The likelihood function is constructed by taking the product of the
probability of each history over all of the birds:
Y
L= P (history)
all birds
which reduces to (why?)
L = P (100)n100 × P (101)n101 × P (110)n110 × P (111)n111
where P (100) has the expression given previously etc.
The likelihood is a function of 2 parameter, φ and p. In order to maximize

the log-likelihood over 2 parameters, you set up a system of two equations in
two unknowns that needs to be solved. Each equation is the partial derivative of
the log-likelihood with respect to each individual parameter. There is no closed
form solution to the system of equation and it must be solved numerically.
The capture-recapture tab in the Excel workbook has the sample computa-
tions and the Solver can easily maximize the log-likelihood function. The MLEs
are φb = .76 and pb = .45.
Unfortunately it is NOT easy to obtain the matrix of second partial deriva-

tives to estimate the information matrix in Excel and so this is not done.
Many of the leading capture-recapture researchers have developed a soft-

ware package MARK available from http://www.phidot.org/software/mark/
docs/book/ that has implemented maximum likelihood estimation in this type
(and many other types) of mark-recapture models. Consult the Program MARK:
A gentle introduction for more details.
2010
1.1.10 Summary
Maximum likelihood estimation is a standard, flexible method for many sta-

tistical problems. In the past, a great deal of effort was placed into finding
closed form solutions – now numerical solutions are the norm and it is rare that
interesting problems have a closed form solution.
MLE can deal with a wide range of data anomalies that are difficult to deal
with in any other way such as truncation or censoring. Usually observations
are assumed to be independent of each other which makes the joint likelihood
easy to compute, but likelihood methods can also deal with non-independent
observations.
The examples above were all for discrete data. Likelihood methods for con-
tinuous data (e.g. normal, log-normal, exponential distributions) are similar
with the density function being used in place of the probability function when
constructing the likelihood.
As noted above, likelihood methods may not perform well with small sample
sizes nor with latent (hidden) variables. In these cases, Bayesian methods may
be an alternative method for parameter estimation.
1.2 AIC and Model Selection
A large part of Statistics is finding models that are adequate representations of

data. In many cases, several models all fit the data to a similar degree - which
then is the “best” to use?
Many statistical methods have been developed to select among models in

the search for the “best” model. Among these are:
• Stepwise Regression Variables are selected from a long list of variables

according to procedures that start either with a very simple model and
then add variables, or procedures that start with a large set of variables
and then drop variables.
• All Subsets Regression For moderate number of variables, it is possible
to examine all the possible subsets of variables of variables and then select
the “best” model.
• Likelihood Ratio Tests When models differ only by a single variable,

a statistical test (called the likelihood ratio test) can be used to see if the
2010
simpler model is an adequate fit to the data. This is analogous to the

backward stepwise regression technique.
However, all of these methods are less than satisfactory for a number of
reasons:
• Assuming there is a BEST model The model selection techniques

make an implicit assumption that there is a single BEST model for the
data. However, real life is not so nice, and statistical models are only an
abstraction of reality.
• No accounting for other models It often turns out that there are
several models that all have similar fits to the data. The methods above do
not account for this multiplicity of models all of which could be reasonable
candidates.
• False objectivity The methods above have a false allure of objectivity.
However, the model selected is highly dependent upon small changes to
the data, upon the significance level used to select among models, and to
the underlying assumptions about the data (e.g. normality, independence)
that are unlikely to be strictly true.
• Standard error are too small The standard errors of parameter esti-
mates or prediction errors for predictions are too small in the sense that
they are conditional upon the model being selected being the only model
considered and is the correct model for the data. The standard errors do
not include uncertainty in the model selection procedure.
• Failure to account for multiple testing Each model comparison in-
volves a statistical test. Each test has a probability of a false positive
result (the α) level. The overall chance of a false positive result among all
the model comparison is much larger than α, yet no adjustment has been
made. This is analogous to the use of multiple comparison procedures to
examine differences among means in Analysis of Variance.
For these reasons, there has been a shift in emphasis in recent years from
finding the “best” model to integrating all models in making predictions.
There are many methods to combine information from competing models

- one popular paradigm is the Akaike Information Criteria (AIC) introduced
by Akaike (1973). Burnham and Anderson (2002) and Anderson (2008) have a
long and detailed look at the use of AIC in model selection in a wide variety of
situations applied to ecological problems.
Under the AIC paradigm, the analyst first starts with a candidate set of
models that are reasonable from a biological viewpoint. Then each model is
2010
fit, and a summary measure that trades off goodness-of-fit and the number of
parameters (the AIC value) is computed for each model. The AIC values are
used to compute a model weight for each model that summarizes how much
weight should be applied to the results of this model in making predictions.
Finally, the predictions from each model are weighted by the AIC weight and
the resulting estimate and standard error incorporates both model uncertainty
and imprecision from sampling in each of the models.
1.2.1 Technical Details
The basic statistical tool for measuring fit of a model to data is the likelihood
function
L(Y ; θ)
or the logarithm of the likelihood functions:
log(L(Y ; θ))
where θ is the set of parameters used to describe the data Y . The estimates
of the parameters that maximize the likelihood or log-likelihood are known are
Maximum Likelihood Estimators (MLEs) and have nice statistical properties.
In some settings the likelihood function is explicit (e.g. in capture-recapture

models) while in other settings the likelihood function is hidden from the analyst
but still lurking in the background (e.g. in regression problems, the sum of
squares residual is a 1-1 function of the likelihood function and minimizing the
sum of squares residual is equivalent to maximizing the likelihood).
As more parameters are added to the model (e.g. more variables to a regres-
sion problem, or a time-varying capture rate is considered rather than a constant
survival rate over time), the value of the likelihood function must increase as
you can always get a better fit to data with more parameters. However, adding
parameters also implies that the total information in the data must be split over
more parameters which gives estimates with worse precision, i.e. the standard
errors of estimates get larger as more parameters are fit. Is the improvement of
fit substantial enough to require the additional parameters with the resulting
loss in precision?
Akaike (1973) developed a combined measure that trades off improvement

of fit and the number of parameters required to obtain this fit:
AIC = −2log(L) + 2K
where K is the number of parameters in the model and log() is the natural
logarithm to the base e. [The multiplier 2 is included for historical reasons.]
2010
As the number of parameters increases (as K increases), the log-likelihood also

increases, but the −2log(L) then becomes more negative. If adding one ad-
ditional parameter substantially improves the fit, then the first term on the
left side decreases more than the increase in K, and AIC is smaller. If adding
one additional parameter gives no substantial increase in fit, then the K term
dominates and the AIC increases.
The ‘optimal’ tradeoff between fit and the number of parameters occurs with
the smallest value of AIC among models in our model set.
The above equation for AIC can be modified slightly to account for small
sample sizes (leading to AICc ) or for a general lack of fit of any model (leading
to QAIC or QAICc ). These details are not explored in this overview, but the
same general principles are applicable.
While the ‘optimal’ model (among those in the model set) is the one with
the lowest AIC, there may be several models that differs from the next-lowest
by only a small amount? How much support is there for selecting one model
over the other? Notice the use of the word support, rather than statistical
significance. Anderson and Burnham (2002) and Anderson (2008) recommend
several rules of thumb to select among models, based on differences in AIC.
The difference in AIC between a specific model and the best fitting model is
denoted as ∆AIC. By definition, ∆AIC = 0 for the best fitting model. When
the difference in AIC between 2 models (∆AIC) is less than 2 units, then one
is reasonably safe in saying that both models have approximately equal weight
in the data. If 2 < ∆AIC < 7, then there is considerable support for a real
difference between the models, and if ∆AIC > 7, then there is strong evidence
to support the conclusion of differences between the models.
This can be quantified further by computing an index of relative plausibility

using normalized Akaike weights. These weights (wi ) for the ith model in the
candidate set are calculated as
exp(− ∆AIC
2 )
wi =
Σexp(− ∆AIC
2 )
i.e. compute exp(− ∆AIC

2 ) for each model, and then normalize these to sum to
1.
The ratio of the weights of 2 models is sometimes referred to as an index

of relative plausibility. For example, a model with wi =0.4 would be twice as
likely (given the data) as a model with wi =0.2. This value is the strength of
evidence of this model relative to other models in the set of models considered.
The uncertainty in which model is selected is accommodated through the use

of model averaging. In this method, estimates from each model are averaged
2010
together using the AIC weights in a weighted average. For example, suppose
there are R models in the candidate set, and let θbi represent the estimate from
the ith model. Then the model averaged estimate is found as:
R
X
θbavg = wi θbi
i=1
Buckland et al. (1997) also showed how to estimate a standard error for
this averaged estimate that includes both the standard error from each of the
candidate models (i.e. sampling uncertainty for each model), and the variation
in the estimate among the candidate models (i.e. model uncertainty):
R
X q
se
b avg = wi se2 (θbi ) + (θbi − θbavg )2
i=1
√
The first component in the ∗ sign refers to the standard error of each
estimate for a particular model; the second component refers to variation in the
estimates around the model averaged estimate.
What about statistical significance - where is the p-value for this model?
The point Anderson and Burnham (2002) and Anderson (2008) make at this
stage is to suggest that this sort of question represents misplaced focus. Instead,
they suggest we should place greater emphasis on the effect size (the magnitude
of the difference in estimates between models), than on significance levels. This
is analogous to the arguments in favor of using confidence intervals in lieu of
p-values in ecology.
1.2.2 AIC and regression
A popular paradigm in selecting regression models is to examine R2 or adjusted

R2 values and select the models with the highest value. As noted elsewhere in
the notes, there are serious problems with the uncritical use of R2 in regression
analysis.
A preferred method is to use AIC in model selection and averaging. It can

be shown that if you make the usual assumption that residuals are normally
distributed around the regression line, that the equation to estimate AIC reduces
to:
SSE
AICregression = n × log( ) + 2K
n
where SSE is the sum of squares error; n is the number of data points; and K
is the number of terms in the model (including the intercept) i.e. K = p + 1
where p is the number of predictors (excluding the intercept).
2010
One important aspect to remember is that the scale of Y CANNOT be

changed among models, i.e. you can’t do a regression on Y and on log(Y ) and
then simply compare the AIC values. In cases where different transformations
on Y are needed, all the models must be fit on the same scale of Y . This may
require a non-linear fit for some models.
There is NO problem in trying different transformations on the X variables -

all the models with different transformations can be compared without problems.
1.2.3 Example: Capture-recapture
Capture-recapture methods are a common technique to estimate important pop-

ulation parameters such as the yearly survival rate, abundance, movement etc.
One simple capture-recapture experiment, the Cormack-Jolly-Seber model, re-
leases marked animals and based upon the subsequent recoveries, estimates sur-
vival and recapture rates. An brief example of how the probability models are
constructed was demonstrated in the section on MLE, and the general theory
is explained in Lebreton et al. (1992). We will consider a well studied example,
a study on the European Dipper.
The study was conducted over 7 years with both males and females marked.
Here is a sample of the raw data in capture-history format. There were a total
of 294 birds that were marked.
History Male Female

1111110 1 0
1111100 0 1
1111000 1 0
1111000 0 1
1101110 0 1
...
For example, the history 1101110 indicates the female bird was captured in
year 1, had a band applied to its leg and released, It was recaptured (alive) in
year 2, not seen in year 3, recaptured (alive) in years, 4, 5, 6 and not seen after
year 6. However, the fate in year 7 is unknown – the bird either died between
years 6 and 7, or it was alive in year 7 and not recaptured – there is no way to
know exactly what happened.
The probability model to describe this data has 4 sets of parameters: a set
of yearly survival rates for the male birds, a set of yearly survival rates for
the female birds, a set of yearly recapture rates for males, and a set of yearly
recapture rates for females. There are several biological hypotheses that can be
2010
examined in a series of models, and interest lies in which model best describes
the data with estimates of survival and recapture. For example, perhaps males
and females have the same yearly survival rates but the rates differ across years.
Or males and females have different survival rates, but the survival rate for each
sex is constant over time. Similarly, there are multiple models for the capture
rates.
In order to keep track of the models, we adopts a standard notation as in

Lebreton et al. (1989). Let g represent a group effect (sex), and t represent
a time effect. The survival rates are commonly denoted using a parameter φ
(phi) and the recapture rates are denoted using p. We can define the following
models:
Model Interpretation
phi(t), p(g) Survival is same for males and females, but the common
survival rate varies across time. Recapture rates are equal
across time, but vary between the groups.
phi(g ∗ t), p(g) Survival is different for every combination of group (sex)
and time, i.e. no two values are equal Recapture rates are
equal across time, but vary between the groups.
phi(g), p(g ∗ t) Survival is different for males and females (a group effect)
but constant over time. Recapture rates are different for
every combination of group (sex) and time.
phi(.), p(g ∗ t) Survival is the same for both males and females (no group
effect) and constant over time. Recapture rates are differ-
ent for every combination of group (sex) and time.
There are many possible combinations. In addition, it is known that a

flood occurred in year 2 and 3; these high waters may also have affected sur-
vival/recapture rates. The high waters could have reduced the food supply and
the high waters may make it more difficult to spot the banded birds. Additional
models incorporating the effect of flood were also considered such as:
Model Interpretation
phi(g ∗ Survival is different between males and females, and each
f lood), p(g) sexes survival rate is equal in across the non-flood or flood
years, but differs between flood and non-flooded years. Re-
capture rates are equal across time, but vary between the
groups.
A total of 20 different models were constructed a priori, i.e. BEFORE looking

at the data based on biological guesses as to plausibility. Review some of the
models in the list to be sure you understand what the notation indicates.
2010
Models considered for the Dipper study
Model
Phi(t) p(t)
Phi(g*t) p(g*t)
Phi(g) p(g)
Phi(.) p(.)
Phi(Flood) p(.)
Phi(Flood) p(Flood)
Phi(.) p(g)
Phi(g) p(.)
Phi(.) p(Flood)
Phi(t) p(.)
Phi(t) p(g)
Phi(.) p(t)
Phi(g) p(t)
Phi(g*t) p(.)
Phi(g*t) p(g)
Phi(.) p(g*t)
Phi(t) p(g*t)
Phi(g*t) p(t)
Phi(g) p(g*t)
We showed earlier how the probability of each capture history can be written
in terms of the parameters φgt and pgt – you should select a few models to write
out some of the histories. You can also get a sense for the models by constructing
a diagram showing the relationship among the parameters (known among users
of the computer program MARK as the PIMs). This will be demonstrated in
class.
Note that biologically speaking, only the model P hi(g ∗ t)p(g ∗ t) could pos-
sibly be true! For example a phi(g) model would say that the survival rates
differ between males and females but are equal across all times. It is logically
impossible that the survival rate in year 1 would be equal to the survival rate
in year 2 to 40 decimal places! So why would be fit such logically impossible
models? The AIC paradigm says that ALL models are wrong, but some models
may closely approximate reality. All else being equal, simpler models are pre-
ferred over more complex models because the uncertainty in the estimates must
be smaller. But even this model likely is wrong, because it (implicitly) assumes
that all birds of the same sex have the same survival rate in each year which
likely isn’t true due to innate difference in fitness among individuals. The real
world is infinitely complex and can’t possibly be captured by simple models,
but we hope that our models are close approximations.
All of the two models were fit using Maximum Likelihood using Program
MARK. A summary of the results is presented below:
2010
Num. Delta AICc Relative
Model -2log(L) Par AICc AICc Weights Likelihood
{Phi(Flood) p(.) } 660.10 3 666.16 0.00 0.61 1.00
{Phi(Flood) p(Flood) } 660.06 4 668.16 2.00 0.23 0.37
{Phi(.) p(.) } 666.84 2 670.87 4.71 0.06 0.10
{Phi(.) p(g) } 666.19 3 672.25 6.09 0.03 0.05
{Phi(g) p(.) } 666.68 3 672.73 6.57 0.02 0.04
{Phi(.) p(Flood) } 666.82 3 672.88 6.72 0.02 0.03
{Phi(t) p(.) } 659.73 7 674.00 7.84 0.01 0.02
{Phi(g) p(g) } 666.15 4 674.25 8.09 0.01 0.02
{Phi(t) p(g) } 659.16 8 675.50 9.34 0.01 0.01
{Phi(.) p(t) } 664.48 7 678.75 12.59 0.00 0.00
{Phi(t) p(t) } 656.95 11 679.59 13.43 0.00 0.00
{Phi(g) p(t) } 664.30 8 680.65 14.49 0.00 0.00
{Phi(g*t) p(.) } 658.24 13 685.12 18.96 0.00 0.00
{Phi(g*t) p(g) } 657.90 14 686.92 20.76 0.00 0.00
{Phi(.) p(g*t) } 662.25 13 689.13 22.97 0.00 0.00
{Phi(t) p(g*t) } 654.53 17 690.03 23.87 0.00 0.00
{Phi(g*t) p(t) } 655.47 17 690.97 24.81 0.00 0.00
{Phi(g) p(g*t) } 662.25 14 691.27 25.11 0.00 0.00
{Phi(g*t) p(g*t) } 653.95 22 700.46 34.30 0.00 0.00
First consider the number of parameters in for each model. For model
{phi(.)p(.)}, there are two parameters - the survival rate (φ) that is common to
both sexes and equal across all years and a similar parameter for the recapture
rates. For model {phi(.)p(F lood)}, there are three parameters: the common
survival rate, and the two recapture rates for flood vs. non-flood years each
of which is the same across both sexes and the respective years. For model
{phi(t), p(.)}, there are 7 parameters: 6 survival rates (between years 1 and 2,
2 and 3, . . ., 6 and 7) that are common to both sexes, and a single recapture
rate that is common to both sexes and all years.
A first blush, the number of parameters for model {phi(t)p(∗t)} should be

12 with six survival rates (as above) and 6 recapture rates (years 2, 3, 4, 5, 6,
and 7). [There is NO recapture rate for year 1 because that is when the birds
were first captured and released and there is no information on the probability
of first capture in year 1.] But the table above indicates that the number of
parameters is 11. Why? It turns out that if the likelihood for this model is
examined closely, the parameters φ6 p7 always occur together in a pair. This
make this pair of parameters confounded because, for example, if φ6 is halved
and p7 is doubled, then the product remains the same and the value of the
likelihood is unchanged. Consequently, rather than counting φ6 and p7 as two
separate parameters, they are counted only as a single confounded parameter
reducing the parameter count by 1. The same problem occurs in any model
where both the survival and recapture rates are time dependent.
2010
One of the challenges of MLE in complex models is determining which pa-

rameters are identifiable – this is well beyond the scope of these notes. For the
moment, just accept that the parameter count in the above table is correct.
As models become more complex (i.e. more parameters), they must fit better.
The log-likelihood must increase. But the table (for historical reasons) reports
−2log(L) which implies that −2log(L) decreases as models get more complex.
But as models become more complex (with more parameters) the same amount
of information is split among more and more estimates leading to estimates with
worse precision (i.e. larger standard errors. For example, compare the estimates
from models with and without the effects of the flood years on the survival rates:
Model {P hi(f lood), p(.)}

Parameters Estimate SE
Phi (non-flood) 0.6070958 0.0309489
Phi (flood) 0.4688271 0.0432355
p 0.8997889 0.0292696
Model {P hi(.), p(.)}

Parameters Estimate SE
Phi 0.560243 0.025133
p 0.902583 0.028585
Notice that the se of the survival rate estimates in the simpler model are
smaller (i.e. more precise) than in the more complex model.
In general, models with a small AICc are “better” in the tradeoff between
model fit and model complexity. The AICc (the regular AIC corrected for small
sample sizes) incorporates both model complexity (number of parameters) and
model fit (likelihood). Look at the AICc for the top two models. The fit is
almost the same (−2log(L)) is virtually identical, but the second model has
one more parameter. Consequently, the AICc “penalizes” the second model
for being unnecessarily complex and the AICc is larger because −2log(L) is
virtually the same + 2× the number of parameters, and after the small sample
adjustment), resulting in almost a 2 unit increase in the AICc.
The Delta AICc column measures how must worse the other models in the
model set are, relative to the “best” model in the model set. A rule of thumb
is that differences of less than 3 or 4 units indicate two models of essentially
equivalent fit. The reason for the “essentially equivalent” fit is that the model
fits (and rankings) will change if the experiment were to be repeated (as any
statistics computed on data must). Accordingly, the first 2 models are not really
distinguishable.
The AICc Weight column is a measure of the “importance” of each model,
2010
relative to the models in the model set. It indicates that most of the weight
should be placed on the the first two models, with some minor weighting placed
on the next 3 or 4 models, but virtually no weight on the remaining models.
In the AIC paradigm there are NO p-values for choosing between models,
and NO model selection trying to find the single best model. The AIC paradigm
recognizes that ALL of the models presented here are likely wrong (even the most
complex model assumes that all animals have the same survival rate within
a sex year combination which is likely untrue because different animals have
different “fitness”. Consequently, trying to select the single “best” model is a
fools paradise! So “testing” if there is a flood effect isn’t done – after all do you
really expect absolute NO effect (to 40 decimal places) of flooding on survival
rates?
So what estimates should be reported? Again, rather than reporting esti-

mates from a single model, the AIC paradigm used model averaging to take
the estimates from the various models and obtain a weighted average (by the
AICc weights) of the estimates from the respective models. Let us consider the
model averaged estimate for the survival rate for male birds between year 1 and
year 2 as presented by MARK:
Apparent Survival Parameter (Phi) Males Parameter 1

Model Weight Estimate Standard Error
--------------------------- ------- -------------- --------------
{Phi(Flood) p(.) } 0.61200 0.6070958 0.0309489
{Phi(Flood) p(Flood) } 0.22558 0.6077116 0.0311473
{Phi(.) p(.) } 0.05818 0.5602430 0.0251330
{Phi(.) p(g) } 0.02912 0.5606992 0.0251604
{Phi(g) p(.) } 0.02287 0.5702637 0.0353294
{Phi(.) p(Flood) } 0.02125 0.5602377 0.0251253
{Phi(t) p(.) } 0.01215 0.6258353 0.1116455
{Phi(g) p(g) } 0.01073 0.5658226 0.0355404
{Phi(t) p(g) } 0.00572 0.6248539 0.1114060
{Phi(.) p(t) } 0.00113 0.5530902 0.0277124
{Phi(t) p(t) } 0.00074 0.7181819 0.1555472
{Phi(g) p(t) } 0.00044 0.5632387 0.0367877
{Phi(g*t) p(.) } 0.00005 0.6173162 0.1511198
{Phi(g*t) p(g) } 0.00002 0.6109351 0.1497766
{Phi(.) p(g*t) } 0.00001 0.5568856 0.0256228
{Phi(t) p(g*t) } 0.00000 0.7158931 0.1544318
{Phi(g*t) p(t) } 0.00000 0.7075812 0.1945868
{Phi(g) p(g*t) } 0.00000 0.5560768 0.0342813
--------------------------- ------- -------------- --------------
Weighted Average 0.6012093 0.0320539
Unconditional SE 0.0378240
2010
95% CI for Wgt. Ave. Est. (logit trans.) is 0.5253025 to 0.6725445

Percent of Variation Attributable to Model Variation is 28.18%
First you need to understand how you can get a survival rate for males in
year 1 from each of the models when sex and/or year does not appear as a model
effect? Consider the first model where there is a flood effect on survival but no
sex effect. Then the survival rate in year 1 for males will the the estimated
survival rate for non-flood years (which is common between the sexes). In
model with phi(.), the estimate comes from the estimated survival rate which
is common among all year and both sexes etc.
Notice that the estimated survival rates for males between year 1 and year
2 varies considerably among models as does the standard error. The AIC
paradigm constructs a weighted average of the estimates from each model which
is reported at the bottom of the page. Because the model weights are close to
100% in total for the first few models, the final estimate must be close to the
estimates from these models.
The model averaged se is not a simple weighted average (see the previous
notes) but follows the same philosophy – models with higher AICc weights
contribute more to the model averaged values. The Unconditional SE line adds
an additional source of uncertainty – that from the models them selves. Notice
that the estimates vary considerably about the model averaged value of .60.
The effect of the different models (after weighting by the AICc weight) is also
incorporated. In this case, the models with vastly different estimates have little
weight, so the unconditional se is very similar (but slightly larger) than the
simple model averaged se.
Now consider the model averaged survival rates for males between year 2
and 3 (a flood year).
Apparent Survival Parameter (Phi) Males Parameter 2

Model Weight Estimate Standard Error
--------------------------- ------- -------------- --------------
{Phi(Flood) p(.) } 0.61200 0.4688271 0.0432355
{Phi(Flood) p(Flood) } 0.22558 0.4680133 0.0433897
{Phi(.) p(.) } 0.05818 0.5602430 0.0251330
{Phi(.) p(g) } 0.02912 0.5606992 0.0251604
{Phi(g) p(.) } 0.02287 0.5702637 0.0353294
{Phi(.) p(Flood) } 0.02125 0.5602377 0.0251253
{Phi(t) p(.) } 0.01215 0.4541913 0.0666224
{Phi(g) p(g) } 0.01073 0.5658226 0.0355404
{Phi(t) p(g) } 0.00572 0.4551758 0.0667781
{Phi(.) p(t) } 0.00113 0.5530902 0.0277124
2010
{Phi(t) p(t) } 0.00074 0.4346707 0.0688290

{Phi(g) p(t) } 0.00044 0.5632387 0.0367877
{Phi(g*t) p(.) } 0.00005 0.4614109 0.1007034
{Phi(g*t) p(g) } 0.00002 0.4582630 0.0998610
{Phi(.) p(g*t) } 0.00001 0.5568856 0.0256228
{Phi(t) p(g*t) } 0.00000 0.4336095 0.0679647
{Phi(g*t) p(t) } 0.00000 0.4387672 0.1015258
{Phi(g) p(g*t) } 0.00000 0.5560768 0.0342813
---------------------------- ------- -------------- --------------
Weighted Average 0.4817958 0.0414640
Unconditional SE 0.0534687
95% CI for Wgt. Ave. Est. (logit trans.) is 0.3792812 to 0.5858662
Percent of Variation Attributable to Model Variation is 39.86%
Some models give the same estimated survival between year 1 and year 2
(non-flood, previous table) and between year 2 and year 3 (flood) - why? The
model averaging continues in the same way. Notice that the unconditional se is
now much larger than the conditional se because there is more variation in the
top models in the estimated survival rates.
The model averaged values for males and females can be computed for all
time periods:
Model averaged apparent survival rates

Est App.
Parameter Survival SE LCI UCI
(Phi) Males Year 1 0.60 0.04 0.53 0.67
(Phi) Males Year 2 0.48 0.05 0.38 0.59
(Phi) Males Year 3 0.48 0.05 0.38 0.59
(Phi) Males Year 4 0.60 0.04 0.53 0.67
(Phi) Males Year 5 0.60 0.03 0.53 0.67
(Phi) Males Year 6 0.60 0.33 0.09 0.96
(Phi) Females Year 1 0.60 0.04 0.52 0.67
(Phi) Females Year 2 0.48 0.05 0.38 0.58
(Phi) Females Year 3 0.48 0.05 0.38 0.58
(Phi) Females Year 4 0.60 0.04 0.53 0.67
(Phi) Females Year 5 0.60 0.04 0.53 0.67
(Phi) Females Year 6 0.60 0.31 0.11 0.95
Note the use of “apparent survival” because a bird that dies or permanently
leaves the study area cannot be distinguished. Also notice the odd results for
the apparent survival rate between years 6 to 7. The se and confidence intervals
are very large because this parameter cannot be estimated in some model (see
above). When using model averaging, you must be careful to only average
estimates that are comparable and identifiable in all models.
2010
The final model averaged values for males and females can be plotted (along
with a 95% confidence interval):
Along with the model fitting above, it is important to conduct through model
assessments to ensure that even the best fitting model is reasonably sensible.
This is not covered in this brief review – refer to the vast literature on capture-
recapture models for assistance.
When using the AIC paradigm it is important to specify the model set
BEFORE the analysis begins to avoid data dredging, and the model set should
be comprehensive to include all models of biological interest. Nevertheless, it is
important not to simply throw in all possible models and mechanically use this
procedure – each model in the model set should have a biological justification.
The program MARK has and extensive implementation of the AIC paradigm
2010
for use in capture-recapture studies. The AIC paradigm is the accepted standard
for publishing work that involves capture-recapture experiment. If you submit
a manuscript involving the use of capture-recapture methods and do not use
AIC, it will likely be returned unread and unreviewed.
1.2.4 Example: Survival of Juvenile Pygmy Rabbits
Please download and read the article:
Price, A.J., Estes-Zumpf, W., and Rachlow, J. (2010).

Survival of Juvenile Pygmy Rabbits.
Journal of Wildlife Management, 74, 43-47.
http://dx.doi.org/10.2193/2008-578.
This paper illustrates the use of AIC methods in modern wildlife manage-
ment research. In this article the authors use capture-recapture methods to
study the survival of juvenile pygmy rabbits in east-central Idaho, US. In their
study, they attached radio-tags to newly born rabbits and then followed rabbits
every 3 or 4 days to see if the rabbit was alive or dead.
The known-fate model that is fit takes into account that the animal is
detected with 100% probability (because of the radio tracking) and if the animal
dies, the time of death is known (to within 3 days). Consequently, the only
known survival paraemters are the weekly survival rates. This differs from
many capture-recapture experiments where detectability is less than 100% and
you must estimate both survival and detection.
Figure 1 shows the Kaplan-Meier survivorship curve. The computation of

this curve takes into account possible censoring (due to radio tags failing etc.)
and is the standard way to dealing with known-fate data. For example, suppose
you have the following (hypothetical) data on the number of deaths by week.
Week Alive at Deaths by

start of end of
week week
1 50 3
2 45 3
3 42 2
4 35 4
5 ...
The survival rate for the first week is .94 = 47/50. This means that there
were 47 alive animals at the start of week 2. However, only 45 could be located (2
2010
radios could have died). The animals with radio tags that could not be located at
the start of week 2 are censored, their fate is known. The KM method computes
the survival rate to the end of week 2 as 42/45. The cumulative survival rate
47
over the first 2 weeks is computed as 50 × 4245 . Because 3 animals died by the
end of week 2, there are 42 animals alive and all radios were located at the start
of week 2, so the number of animals alive is now 42. The cumulative survival
to the end of week 3 is computed as: 47 42 40
50 × 45 × 42 . The table and the KM
estimates can be extended to the end of the study in a similar fashion.
The KM estimate is the MLE for this process.
There were several potential predictor variables for the weekly survival rates
as outlined in the METHODS section. Based on biologically reasonable grounds,
a set of 14 a priori models was constructed (the model set). Examine the model
set in Table 1 – be sure you understand the differences among the models and
the biological interpretation of the models. In particular what does a model
YEAR × AREA mean? What does the model Constant survival mean?
etc. What does the model with the effect of BORN mean?
ML estimates were found for each model, and AIC was used to rank the
relative fit and complexity for the models in the model set (Table 1). Be sure
you understand the number of parameters in each model.
What is meant by the sentence in the report “A set of 9 models was included
in the top model set . . . indciated relatively high model uncertainty.” (First
paragraph under Figure 1).
Understand how Table 2 was computed and how to interpret the table.
An understanding of how the table was computed will have to be conceptual
because the authors have omitted many details from the paper such as the
fact that survival was modelled on the logit scale, the covariate for BORN was
standardized automatically by MARK etc. As such, the actual numbers in Table
2 are pretty much useless for actual hand computations (!).
Look at Figure 3. The bar suggest that the effect of year is about the same
in both areas except translated upwards. What “model” does this suggest? Did
this model rank high in Table 1? [Don’t forget to take into account the size of
the se shown in the plot.]
Notice that there are NO p-values in the entire paper, and every estimate
has an associated measure of precision (a se).
2010
1.3 References
Akaike, H. (1973). Information theory as an extension of the maximum likeli-

hood principle. Pages 267-281 in Second international symposium on informa-
tion theory. B. N. Petrov and F. Csaki, (editors). Akademiai Kiado, Budapest.
Anderson, D.R. (2008). Model Based Inference in the Life Sciences: A Primer
on Evidence. Springer, New York.
Buckland, S. T., Burnham, K. P., and Augustin, N. H. (1997) Model selection:

an integral part of inference. Biometrics 53, 603-618.
Burnham, K. P., and Anderson, D. R. (2002). Model selection and inference:

a practical information theoretic approach. 2nd Edition. Springer-Verlag, New
York, NY.
Lebreton, J.-D., Burnham, K.P., Clobert, J. & Anderson, D.R. (1992). Model-
ing survival and testing biological hypotheses using marked animals: a unified
approach with case studies. Ecological Monographs, 62, 67-118.
2010

Sampling, Regression, Experimental Design and Analysis For Environmental Scientists, Biologists, and Resource Managers

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Sampling, Regression, Experimental Design and Analysis For Environmental Scientists, Biologists, and Resource Managers

Încărcat de

Drepturi de autor:

Formate disponibile

Sampling, Regression, Experimental Design and

Analysis for Environmental Scientists,

January 16, 2011

1 Maximum Likelihood Estimation & AIC and model selection -

1.1 Maximum Likelihood Estimation (MLE)

These notes provide a (very) brief introduction to Maximum Likelihood Esti-

A review of MLE is found at http://en.wikipedia.org/wiki/Maximum_

Maximum Likelihood Estimation (MLE) was first developed by Sir R. A.

The key advantages of MLE are that:

• a unified framework to obtain estimators.

• The likelihood function may not be computationally tractable. This often

1.1.1 The probability model

A non-parametric way to proceed would be to simply take the average count

1.1.2 The likelihood

MLE starts by defining the likelihood of each data value.

is a design based estimator as covered in the chapter on Sampling Theory.

Y = 1 would imply that the number of objects in the sample was 1.

We will assume that each of three samples were selected independently of

e−µ µy1 e−µ µy2 e−µ µy3

e−µ µ4 e−µ µ10 e−µ µ15

1.1.3 The MLE

However, simple calculus gives us a way to do this without resorting to the

It turns out that (for mathematical reasons), it is more convenient to look

Because the log() function is a monotonic transform, the maximum of log(L)

1.1.4 The precision of the MLE

But, never report a “naked estimate”, i.e. it is necessary to attach a measure of

It turns out that a measure of “information” in the data is defined as the

The second derivative is found by taking the derivative of the derivative of

Almost there! The se of an estimator is found as the square-root of the

1.1.5 Numerical optimization

The above example could be solved analytically to give a closed-form solution.

The key steps are to

• Set up a area for the raw data

• Set up a cell for the parameter of interest.

The log-likelihood function can then be maximized by changing the value of

1.1.6 Example2: Different sizes of sampling units.

The likelihood function for ALL the data points is:

The log-likelihood function for ALL of the data points is:

We find the se by finding the measure of information using the negative of

and upon some arithmetic substituting in the MLE for µ we get:

The PoissonDifferent tab in the Excel worksheet from http://www.stat.

1.1.7 Example: What is the probability of becoming preg-

Fertility scientists are interested in estimating the probability of becoming

This problem is amenable to Maximum Likelihood Estimation. A common

Let Y be the number of months PRIOR to becoming pregnant. Then

which after some algebra reduces to

The log-likelihood is then:

If there were no censoring, the above equation can be solved explicitly to

Unfortunately, in the presence of censoring, there is NO explicit solution

Again the solver can be used to do the optimization:

The MLE is pb = .21.

Scheike, T.H. and Keiding, N. (2006) Design and analysis of time-

1.1.8 Example: Zero-truncated distribution

The Poisson distribution is a popular distribution used to model smallish counts.

The probability distribution for a zero-truncated Poisson distribution (Y >

The likelihood development is straightforward and not detailed here. There

The ZeroTruncatedPoisson tab in the Excel workbook available at: http://

1.1.9 Example: Capture-recapture with 2 parameters

Capture-recapture is a common method to study population dynamics of wildlife.

P (110) = φp(1 − φ + φ(1 − p))

Note the two possible outcomes after the recapture in year 2.

P (100) = (1 − phi) + φ(1 − p)(1 − φ) + φ(1 − p)φ(1 − p)

which reduces to (why?)