Documente Academic
Documente Profesional
Documente Cultură
edu/stat501/print/book/export/html/369
Multiple linear regression can be generalized to handle a response variable that is categorical
or a count variable. This lesson covers the basics of such models, specifically logistic and
Poisson regression, including model fitting and inference.
Multiple linear regression, logistic regression, and Poisson regression are examples of
generalized linear models, which this lesson introduces briefly.
The lesson concludes with some examples of nonlinear regression, specifically exponential
regression and population growth models.
Used when the response is binary (i.e., it has two possible outcomes). The cracking
example given above would utilize binary logistic regression. Other examples of binary
responses could include passing or failing a test, responding yes or no on a survey, and
having high or low blood pressure.
Used when there are three or more categories with no natural ordering to the levels.
clude departments at a
business (e.g.,
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
1 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color
(black, red, blue, orange).
Used when there are three or more categories with a natural ordering to the levels, but
the ranking of the levels do not necessarily mean the intervals between them are equal.
Examples of ordinal responses could be how students rate the effectiveness of a college
course on a scale of 1-5, levels of flavors for hot wings, and medical condition (e.g., good,
stable, serious, critical).
Particular issues with modelling a categorical response variable include nonnormal error
terms, nonconstant error variance, and constraints on the response function (i.e., the
response is bounded between 0 and 1). We will investigate ways of dealing with these in the
binary logistic regression setting here. There is some discussion of the nominal and ordinal
logistic regression settings in Section 15.2.
where here denotes a probability and not the irrational number 3.14....
For a sample of size n, the likelihood for a binary logistic regression is given by:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
2 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.
The following gives the estimated logistic regression equation and associated significance
tests from Minitab v17:
Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.
Select "REMISS" for the Response (the response event for remission is 1 for this data).
Select all the predictors as Continuous predictors.
Click Options and choose Deviance or Pearson residuals for diagnostic plots.
Click Graphs and select "Residuals versus order."
Click Results and change "Display of results" to "Expanded tables."
Click Storage and select "Coefficients."
Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant 64.3 75.0 ( -82.7, 211.2) 0.86 0.391
CELL 30.8 52.1 ( -71.4, 133.0) 0.59 0.554 62.46
SMEAR 24.7 61.5 ( -95.9, 145.3) 0.40 0.688 434.42
INFIL -25.0 65.3 (-152.9, 103.0) -0.38 0.702 471.10
LI 4.36 2.66 ( -0.85, 9.57) 1.64 0.101 4.43
BLAST -0.01 2.27 ( -4.45, 4.43) -0.01 0.996 4.18
TEMP -100.2 77.8 (-252.6, 52.2) -1.29 0.198 3.01
Wald Test
The Wald test is the test of significance for individual regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum likelihood estimates,
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
3 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
the ratio
can be used to test . The standard normal curve is used to determine the
-value of the test. Furthermore, confidence intervals can be constructed as
Estimates of the regression coefficients, , are given in the Minitab output Coefficients table
in the column labeled "Coef." This table also gives coefficient p-values based on Wald tests.
The index of the bone marrow leukemia cells (LI) has the smallest p-value and so appears to
be closest to a significant predictor of remission occurring. After looking at various subsets of
the data, we find that a good model is one which only includes the labeling index as a
predictor:
Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant -3.78 1.38 (-6.48, -1.08) -2.74 0.006
LI 2.90 1.19 ( 0.57, 5.22) 2.44 0.015 1.00
Regression Equation
P(1) = exp(Y')/(1 + exp(Y'))
Y' = -3.78 + 2.90 LI
Since we only have a single predictor in this model we can create a Binary Fitted Line Plot to
visualize the sigmoidal shape of the fitted logistic regression curve:
There are algebraically equivalent ways to write the logistic regression model:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
4 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
The first is
which is an equation that describes the odds of being in the current category of interest. By
definition, the odds for an event is π / (1 - π) such that π is the probability of the event. For
example, if you are at the racetrack and there is a 80% chance that a certain horse will win
the race, then his odds are 0.80 / (1 - 0.80) = 4, or 4:1.
The second is
which states that the (natural) logarithm of the odds is a linear function of the X variables (and
is often called the log odds). This is also referred to as the logit transformation of the
probability of success, .
The odds ratio (which we will write as ) between the odds for two sets of predictors (say
and ) is given by
By plugging this into the formula for above and setting equal to except in one
position (i.e., only one predictor differs by one unit), we can determine the relationship
between that predictor and the response. The odds ratio can be any nonnegative number. An
odds ratio of 1 serves as the baseline for comparison and indicates there is no association
between the response and predictor. If the odds ratio is greater than 1, then the odds of
success are higher for higher levels of a continuous predictor (or for the indicated level of a
factor). In particular, the odds increase multiplicatively by for every one-unit increase
in . If the odds ratio is less than 1, then the odds of success are less for higher levels of a
continuous predictor (or for the indicated level of a factor). Values farther from 1 represent
stronger degrees of association.
For example, when there is just a single predictor, , the odds of success are:
To illustrate, the relevant Minitab output from the leukemia example is:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
5 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
The odds ratio for LI of 18.1245 is calculated as (you can view more decimal
places for the coefficient estimates in Minitab by clicking "Storage" in the Regression Dialog
and selecting "Coefficients"). The 95% confidence interval is calculated as
, where is the percentile from the
standard normal distribution. The interpretation of the odds ratio is that for every increase of 1
unit in LI, the estimated odds of leukemia remission are multiplied by 18.1245. However, since
the LI appears to fall between 0 and 2, it may make more sense to say that for every .1 unit
increase in L1, the estimated odds of remission are multiplied by
. Then
The likelihood ratio test is used to test the null hypothesis that any subset of the 's is equal
to 0. The number of 's in the full model is p, while the number of 's in the reduced model is
r. (Remember the reduced model is the model that results when the 's in the null hypothesis
are set to 0.) Thus, the number of 's being tested in the null hypothesis is . Then the
likelihood ratio test statistic is given by:
where is the log likelihood of the fitted (full) model and is the log likelihood of the
(reduced) model specified by the null hypothesis evaluated at the maximum likelihood
estimate of that reduced model. This test statistic has a distribution with degrees of
freedom. Minitab presents results for this test in terms of "deviance," which is defined as
times log-likelihood. The notation used for the test statistic is typically = deviance
(reduced) – deviance (full).
This test procedure is analagous to the general linear F test procedure for multiple linear
regression. However, note that when testing a single coefficient, the Wald test and likelihood
ratio test will not in general give identical results.
To illustrate, the relevant software output from the leukemia example is:
Deviance Table
Source DF Adj Dev Adj Mean Chi-Square P-Value
Regression 1 8.299 8.299 8.30 0.004
LI [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
Loading 1 8.299 8.299 8.30 0.004
6 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Since there is only a single predictor for this example, this table simply provides information
on the likelihood ratio test for LI (p-value of 0.004), which is similar but not identical to the
earlier Wald test result (p-value of 0.015). The Deviance Table includes the following:
The null (reduced) model in this case has no predictors, so the fitted probabilities are
simply the sample proportion of successes, . The log-likelihood for
the null model is , so the deviance for the null model is
, which is shown in the "Total" row in the Deviance Table.
The log-likelihood for the fitted (full) model is , so the deviance for
the fitted model is , which is shown in the "Error" row in the
Deviance Table.
The likelihood ratio test statistic is therefore
, which is the same as
.
The p-value comes from a distribution with degrees of freedom.
When using the likelihood ratio (or deviance) test for more than one regression coefficient, we
can first fit the "full" model to find deviance (full), which is shown in the "Error" row in the
resulting full model Deviance Table. Then fit the "reduced" model (corresponding to the model
that results if the null hypothesis is true) to find deviance (reduced), which is shown in the
"Error" row in the resulting reduced model Deviance Table. For example, the relevant
Deviance Tables for the Disease Outbreak [2] example on pages 581-582 of Applied Linear
Regression Models (4th ed) by Kutner et al are:
Full model:
Reduced model:
Here the full model includes four single-factor predictor terms and five two-factor interaction
terms, while the reduced model excludes the interaction terms. The test statistic for testing
the interaction terms is , which is compared to a chi-
square distribution with degrees of freedom to find the p-value > 0.05 (meaning
the interaction terms are not significant).
Alternatively, select the corresponding predictor terms last in the full model and request
Minitab to output Sequential (Type I) Deviances. Then add the corresponding Sequential
Deviances in the resulting Deviance Table to calculate . For example, the relevant
Deviance Table for the Disease Outbreak [2] example is:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
7 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by several different goodness-of-fit
tests. Two tests that require replicated data (multiple observations with the same values for all
the predictors) are the Pearson chi-square goodness-of-fit test and the deviance
goodness-of-fit test (analagous to the multiple linear regression lack-of-fit F-test). Both of
these tests have statistics that are approximately chi-square distributed with c - p degrees of
freedom, where c is the number of distinct combinations of the predictor variables. When a
test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of
lack of fit.
To illustrate, the relevant Minitab output from the leukemia example is:
Goodness-of-Fit Tests
Test DF Chi-Square P-Value
Deviance 25 26.07 0.404
Pearson 25 23.93 0.523
Hosmer-Lemeshow 7 6.87 0.442
Since there is no replicated data for this example, the deviance and Pearson goodness-of-fit
tests are invalid, so the first two rows of this table should be ignored. However, the Hosmer-
Lemeshow test does not require replicated data so we can interpret its high p-value as
indicating no evidence of lack-of-fit.
R2
The calculation
Loading of R2 used in linear regression does not extend directly
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js to logistic regression.
8 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
where is the log likelihood of the model when only the intercept is included and
is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data).
This R2 does go from 0 to 1 with 1 being a perfect fit. With unreplicated data, , so
the formula simplifies to:
To illustrate, the relevant Minitab output from the leukemia example is:
Model Summary
Deviance Deviance
R-Sq R-Sq(adj) AIC
24.14% 21.23% 30.07
Note that we can obtain the same result by simply using deviances instead of log-likelihoods
since the factor cancels out:
Raw Residual
The raw residual is the difference between the actual response and the estimated probability
from the model. The formula for the raw residual is
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals by dividing by
the standard deviation. The formula for the Pearson residuals is
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these residuals is the
deviance
Loading statistic. The formula for the deviance residual is
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
9 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Here are the plots of the Pearson residuals and deviance residuals for the leukemia example.
There are no alarming patterns in these plots to suggest a major problem with the model.
Hat Values
The hat matrix serves a similar purpose as in the case of linear regression – to measure the
influence of each observation on the overall fit of the model – but the interpretation is not as
clear due to its more complicated form. The hat values (leverages) are given by
10 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Studentized Residuals
We can also report Studentized versions of some of the earlier residuals. The Studentized
Pearson residuals are given by
C and
C and are extensions of Cook's distance for logistic regression. C measures the overall
change in fitted logits due to deleting the observation for all points including the one
deleted, while excludes the deleted point. They are defined by:
and
To illustrate, the relevant Minitab output from the leukemia example is:
The default residuals in this output (set under Minitab's Regression Options) are deviance
residuals, so observation 8 has a deviance residual of , a studentized deviance
residual of , a leverage (h) of , and a Cook's distance (C) of 0.58.
DFDEV and DFCHI are statistics that measure the change in deviance and in Pearson's chi-
square, respectively, that occurs when an observation is deleted from the data set. Large
values of these statistics indicate observations that have not been fitted well. The formulas for
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
11 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
and
In binary logistic regression, we only had two possible outcomes. For polytomous logistic
regression, we will consider the possibility of having k > 2 possible outcomes. (Note: The
word polychotomous is sometimes used, but note that this is not actually a word!)
The multiple nominal logistic regression model (sometimes called the multinomial logistic
regression model) is given by the following:
where again denotes a probability and not the irrational number. Notice that k - 1 of the
groups have their own set of values. Furthermore, since , we set the
values for group 1 to be 0 (this is what we call the reference group). Notice that when k = 2,
we are back to binary logistic regression.
is the probability that an observation is in one of k categories. The likelihood for the
nominal logistic regression model is given by:
where the subscript means the observation belongs to the group. This yields the
log likelihood:
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
12 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.
Many of the procedures discussed in binary logistic regression can be extended to nominal
logistic regression with the appropriate modifications.
For ordinal logistic regression, we again consider k possible outcomes as in nominal logistic
regression, except that the order matters. The multiple ordinal logistic regression model is the
following:
\pi_{j} is still the probability that an observation is in one of k categories, but we are
constrained by the model written in the equation above. The likelihood for the ordinal logistic
regression model is given by:
\begin{align*} L(\beta;\textbf{y},\textbf{X})&=\prod_{i=1}^{n}\prod_{j=1}^{k}\pi_{i,j}^{y_{i,j}}
(1-\pi_{i,j})^{1-y_{i,j}}, \end{align*}
where the subscript (i, j) means the i^{\textrm{th}} observation belongs to the j^{\textrm{th}}
group. This yields the log likelihood:
Notice that this is identical to the nominal logistic regression likelihood. Thus, maximization
again has no closed-form solution, so we defer to a procedure like iteratively reweighted least
squares.
For ordinal logistic regression, a proportional odds model is used to determine the odds ratio.
Again, an odds ratio (\theta) of 1 serves as the baseline for comparison between the two
predictor levels, say \textbf{X}_{(1)} and \textbf{X}_{(2)}. Only one parameter and one odds
ratio is[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
Loading calculated for each predictor. Suppose we are interested in calculating the odds of
13 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
\begin{equation*} \theta=\frac{\sum_{j=1}^{k^{*}}\pi_{j}|_{\textbf{X}=\textbf{X}_{(1)}}/(1-
\sum_{j=1}^{k^{*}}\pi_{j})|_{\textbf{X}=\textbf{X}_{(1)}}}{\sum_{j=1}^{k^{*}}\pi_{j}|_{\textbf{X}=
\textbf{X}_{(2)}}/(1\sum_{j=1}^{k^{*}}\pi_{j})|_{\textbf{X}=\textbf{X}_{(2)}}}. \end{equation*}
Students in STAT 200 at Penn State were asked if they have ever driven after drinking
(dataset unfortunately no longer available). They also were asked, “How many days per
month do you drink at least two beers?” In the following discussion, \pi = the probability a
student says “yes” they have driven after drinking. This is modeled using X = days per month
of drinking two beers. Results from Minitab were as follows.
We see that in the sample 122/249 students said they have driven after drinking.
(Yikes!)
Parameter estimates, given under Coef are \hat{\beta}_0 = −1.5514, and \hat{\beta}_1 =
0.19031.
The model for estimating \pi = the probability of ever having driven after drinking is
\hat{\pi}=\frac{\exp(-1.5514+0.19031X)}{1+\exp(-1.5514+0.19031X)}
We can also obtain a plot of the estimated probability of ever having driven under the
influence (\pi) versus days per month of drinking at least two beers.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
14 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
The vertical axis shows the probability of ever having driven after drinking. For example, if X =
4 days per month of drinking beer, then the estimated probability is calculated as:
\hat{\pi}=\frac{\exp(-1.5514+0.19031(4))}{1+\exp(-1.5514+0.19031(4))}=\frac{\exp(-0.79016)}
{1+\exp(-0.79016)}=0.312
DaysBeer 4 12 20 28
In the results given above, we see that the estimate of the odds ratio is 1.21 for DaysBeer.
This is given under Odds Ratio in the table of coefficients, standard errors and so on. The
sample odds ratio was calculated as e0.19031. The interpretation of the odds ratio is that for
each increase of one day of drinking beer per month, the predicted odds of having ever driven
after drinking are multiplied by 1.21.
Above we found that at X = 4, the predicted probability of ever driving after drinking is \hat{\pi}
= 0.312. Thus when X = 4, the predicted odds of ever driving after drinking is 0.312/(1 −
0.312) = 0.453. To find the odds when X = 5, one method would be to multiply the odds at X =
4 by the sample odds ratio. The calculation is 1.21 × 0.453 = 0.549. (Another method is to just
do the calculation as we did for X = 4.)
Notice also, that the results give a 95% confidence interval estimate of the odd ratio (1.14 to
1.28).
We now include Gender (male or female) as an x-variable (along with DaysBeer). Some
Minitab results are given below. Under Gender, the row for male is explaining that the
program created an indicator variable with a value of 1 if the student is male and a value of 0
if the student is female.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
15 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
The p-values are less than 0.05 for both DaysBeer and Gender. This is evidence that
both x-variables are useful for predicting the probability of ever having driven after
drinking.
For DaysBeer, the odds ratio is still estimated to equal 1.21 to two decimal places
(calculated as e0.18693).
For Gender, the odds ratio is 1.85 (calculated as e0.6172). For males, the odds of ever
having driven after drinking is 1.85 times the odds for females, assuming DaysBeer is
held constant.
Finally, the results for testing with respect to the multiple logistic regression model are as
follows:
Notice that since we have a p-value of 0.000 for this chi-square test, we therefore reject the
null hypothesis that all of the slopes are equal to 0.
A binary logistic regression model is used to describe the connection between the observed
probabilities of death as a function of dose level. Since the data is in event/trial format the
procedure in Minitab v17 is a little different to before:
Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model
Select "Response is in event/trial format"
Select "Deaths" for Number of events, "SampSize" for Number of trials (and type
"Death" for Event name if you like)
Select Dose as a Continuous predictor
Click Results and change "Display of results" to "Expanded tables"
Click Storage and select "Fits (event probabilities)"
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
16 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Thus
\hat{\pi}=\frac{\exp(-2.644+0.674X)}{1+\exp(-2.644+0.674X)}
where X = Dose and \hat{\pi} is the estimated probability the insect dies (based on the
model).
Predicted probabilities of death (based on the logistic model) for the six dose levels are given
below (FITS1). These probabilities closely agree with the observed values (Observed p)
reported.
\hat{\pi}=\frac{\exp(-2.644+0.674(1))}{1+\exp(-2.644+0.674(1))}=0.1224
The odds ratio for Dose is 1.9621, the value under Odds Ratio in the output. It was calculated
as e0.674. The interpretation of the odds ratio is that for every increase of 1 unit in dose level,
the estimated odds of insect death are multiplied by 1.9621.
A property of the binary logistic regression model is that the odds ratio is the same for any
increase of one unit in X, regardless of the specific values of X.
17 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
for y=0,1,2,\ldots. Notice that the Poisson distribution is characterized by the single parameter
\lambda, which is the mean rate of occurrence for the event being measured. For the Poisson
distribution, it is assumed that large counts (with respect to the value of \lambda) are rare.
In Poisson regression the dependent variable (Y) is an observed count that follows the
Poisson distribution. The rate \lambda is determined by a set of p-1 predictors \textbf{X}=
(X_{1},\ldots,X_{p-1}). The expression relating these quantities is
\begin{equation*} \mbox{P}(Y_{i}=y_{i}|\textbf{X}_{i},\beta)=\frac{e^{-\exp\{\textbf{X}_{i}\beta\}}
\exp\{\textbf{X}_{i}\beta\}^{y_{i}}}{y_{i}!}. \end{equation*}
That is, for a given set of predictors, the categorical outcome follows a Poisson distribution
with rate \exp\{\textbf{X}\beta\}. For a sample of size n, the likelihood for a Poisson regression
is given by:
\begin{equation*} L(\beta;\textbf{y},\textbf{X})=\prod_{i=1}^{n}\frac{e^{-\exp\{\textbf{X}_{i}
\beta\}}\exp\{\textbf{X}_{i}\beta\}^{y_{i}}}{y_{i}!}. \end{equation*}
\begin{equation*} \ell(\beta)=\sum_{i=1}^{n}y_{i}\textbf{X}_{i}\beta-\sum_{i=1}^{n}
\exp\{\textbf{X}_{i}\beta\}-\sum_{i=1}^{n}\log(y_{i}!). \end{equation*}
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
\hat{\beta}. Once this value of \hat{\beta} has been obtained, we may proceed to define
various goodness-of-fit measures and calculated residuals. For the residuals we present, they
serve the same purpose as in linear regression. When plotted versus the response, they will
help identify suspect data points.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
18 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
The following gives the analysis of the Poisson regression data in Minitab v17:
Select Stat > Regression > Poisson Regression > Fit Poisson Model.
Select "y" for the Response.
Select "x" as a Continuous predictor.
Click Results and change "Display of results" to "Expanded tables."
Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant 0.308 0.289 (-0.259, 0.875) 1.06 0.287
x 0.0764 0.0173 (0.0424, 0.1103) 4.41 0.000 1.00
Regression Equation
y = exp(Y')
Y' = 0.308 + 0.0764 x
As you can see, the Wald test p-value for x of 0.000 indicates that the predictor is highly
significant.
Deviance Test
Changes in the deviance can be used to test the null hypothesis that any subset of the \beta's
is equal to 0. The deviance, D(\hat{\beta}), is -2 times the difference between the log-
likelihood evaluated at the maximum likelihood estimate and the log-likelihood for a "saturated
model" (a theoretical model with a separate parameter for each observation and thus a
perfect fit). Suppose we test that r < p of the \beta's are equal to 0. Then the deviance test
statistic is given by:
where D(\hat{\beta}) is the deviance of the fitted (full) model and D(\hat{\beta}^{(0)}) is the
deviance of the model specified by the null hypothesis evaluated at the maximum likelihood
estimate of that reduced model. This test statistic has a \chi^{2} distribution with p-r degrees
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
19 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
of freedom. This test procedure is analagous to the general linear F test procedure for
multiple linear regression.
To illustrate, the relevant Minitab output from the simulated example is:
Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square
P-Value
Regression 1 20.47 42.37% 20.47 20.4677
20.47 0.000
x 1 20.47 42.37% 20.47 20.4677
20.47 0.000
Error 28 27.84 57.63% 27.84 0.9944
Total 29 48.31 100.00%
Since there is only a single predictor for this example, this table simply provides information
on the deviance test for x (p-value of 0.000), which matches the earlier Wald test result (p-
value of 0.000). (Note that the Wald test and deviance test will not in general give identical
results.) The Deviance Table includes the following:
The null model in this case has no predictors, so the fitted values are simply the sample
mean, 4.233. The deviance for the null model is D(\hat{\beta}^{(0)})=48.31, which is
shown in the "Total" row in the Deviance Table.
The deviance for the fitted model is D(\hat{\beta})=27.84, which is shown in the "Error"
row in the Deviance Table.
The deviance test statistic is therefore G^2=48.31-27.84=20.47.
The p-value comes from a \chi^{2} distribution with 2-1=1 degrees of freedom.
Goodness-of-Fit
Overall performance of the fitted model can be measured by two different chi-square tests.
There is the Pearson statistic
\begin{equation*} X^2=\sum_{i=1}^{n}\frac{(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})^{2}}{\exp\
{\textbf{X}_{i}\hat{\beta}\}} \end{equation*}
\begin{equation*} D=2\sum_{i=1}^{n}\biggl[y_{i}\log\biggl(\frac{y_{i}}{\exp\{\textbf{X}_{i}
\hat{\beta}\}}\biggr)-(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})\biggr]. \end{equation*}
To illustrate, the relevant Minitab output from the simulated example is:
Goodness-of-Fit Tests
Test DF Estimate Mean Chi-Square P-Value
Deviance 28 27.84209 0.99436 27.84 0.473
Pearson 28 26.09324 0.93190 26.09 0.568
20 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Overdispersion means that the actual covariance matrix for the observed data exceeds that
for the specified model for Y|\textbf{X}. For a Poisson distribution, the mean and the variance
are equal. In practice, the data almost never reflects this fact and we have overdispersion in
the Poisson regression model if (as is often the case) the variance is greater than the mean.
In addition to testing goodness-of-fit, the Pearson statistic can also be used as a test of
overdispersion. Note that overdispersion can also be measured in the logistic regression
models that were discussed earlier.
Pseudo R2
The value of R2 used in linear regression also does not extend to Poisson regression. One
commonly used measure is the pseudo R2, defined as
\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}=1-
\frac{D(\hat{\beta})}{D(\hat{\beta_{0}})}, \end{equation*}
where \ell(\hat{\beta_{0}}) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.
To illustrate, the relevant Minitab output from the simulated example is:
Model Summary
Deviance Deviance
R-Sq R-Sq(adj) AIC
42.37% 40.30% 124.50
Raw Residual
The raw residual is the difference between the actual response and the estimated value from
the model. Remember that the variance is equal to the mean for a Poisson random variable.
Therefore, we expect that the variances of the residuals are unequal. This can lead to
difficulties in the interpretation of the raw residuals, yet it is still used. The formula for the raw
residual is
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals by dividing by
the standard deviation. The formula for the Pearson residuals is
where
\begin{equation*} \hat{\phi}=\frac{1}{n-p}\sum_{i=1}^{n}\frac{(y_{i}-\exp\{\textbf{X}_{i}
\hat{\beta}\})^{2}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}. \end{equation*}
21 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these residuals is the
deviance statistic. The formula for the deviance residual is
\begin{equation*} d_{i}=\texttt{sgn}(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})\sqrt{2\biggl\{y_{i}
\log\biggl(\frac{y_{i}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}\biggr)-(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}
\})\biggr\}}. \end{equation*}
The plots below show the Pearson residuals and deviance residuals versus the fitted values
for the simulated example.
These plots appear to be good for a Poisson fit. Further diagnostic plots can also be produced
and model selection techniques can be employed when faced with multiple predictors.
Hat Values
22 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
influence of each observation on the overall fit of the model. The hat values, h_{i,i}, are the
diagonal entries of the Hat matrix
\begin{equation*} H=\textbf{W}^{1/2}\textbf{X}(\textbf{X}\textbf{W}\textbf{X})^{-1}\textbf{X}
\textbf{W}^{1/2}, \end{equation*}
Studentized Residuals
Finally, we can also report Studentized versions of some of the earlier residuals. The
Studentized Pearson residuals are given by
To illustrate, the relevant Minitab output from the simulated example is:
The default residuals in this output (set under Minitab's Regression Options) are deviance
residuals, so observation 8 has a deviance residual of 1.974 and a studentized deviance
residual of 2.02, while observation 21 has a leverage (h) of 0.233132.
23 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
which can also be used in logistic regression. This link function is also sometimes called
the gompit link.
where \lambda\neq 0. This is used in other regressions which we do not explore (such as
gamma regression and inverse Gaussian regression).
Also, the variance is typically a function of the mean and is often written as
The random variable Y is assumed to belong to an exponential family distribution where the
density can be expressed in the form
\begin{equation*} q(y;\theta,\phi)=\exp\biggl\{\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\biggr\},
\end{equation*}
where a(\cdot), b(\cdot), and c(\cdot) are specified functions, \theta is a parameter related to
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
24 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
the mean of the distribution, and \phi is called the dispersion parameter. Many probability
distributions belong to the exponential family. For example, the normal distribution is used for
traditional linear regression, the binomial distribution is used for logistic regression, and the
Poisson distribution is used for Poisson regression. Other exponential family distributions lead
to gamma regression, inverse Gaussian (normal) regression, and negative binomial
regression, just to name a few.
The unknown parameters, \beta, are typically estimated with maximum likelihood techniques
(in particular, using iteratively reweighted least squares), Bayesian methods, or quasi-
likelihood methods. The quasi-likelihood is a function which possesses similar properties to
the log-likelihood function and is most often used with count or binary data. Specifically, for a
realization y of the random variable Y, it is defined as
where \sigma^{2} is a scale parameter. There are also tests using likelihood ratio statistics for
model development to determine if any predictors may be dropped from the model.
\begin{align*} y_{i}&=\frac{e^{\beta_{0}+\beta_{1}x_{i}}}{1+e^{\beta_{0}+\beta_{1}x_{i}}}+
\epsilon_{i} \\ y_{i}&=\frac{\beta_{0}+\beta_{1}x_{i}}{1+\beta_{2}e^{\beta_{3}x_{i}}}+\epsilon_{i}
\\ y_{i}&=\beta_{0}+(0.4-\beta_{0})e^{-\beta_{1}(x_{i}-5)}+\epsilon_{i}. \end{align*}
However, there are some nonlinear models which are actually called intrinsically linear
because they can be made linear in the parameters by a simple transformation. For example:
can be rewritten as
\begin{align*} \frac{1}{Y}&=\frac{1}{\beta_{0}}+\frac{\beta_{1}}{\beta_{0}}\frac{1}{X}\\
&=\theta_{0}+\theta_{1}\frac{1}{X}, \end{align*}
which is linear in the transformed parameters \theta_{0} and \theta_{1}. In such cases,
transforming a model to its linear form often provides better inference procedures and
confidence intervals, but one must be cognizant of the effects that the transformation has on
the distribution of the errors.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
25 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Returning to cases in which it is not possible to transform the model to a linear form, consider
the setting
where the \epsilon_{i} are iid normal with mean 0 and constant variance \sigma^{2}. For this
setting, we can rely on some of the least squares theory we have developed over the course.
For other nonnormal error terms, different techniques need to be employed.
First, let
In order to find
we first find each of the partial derivatives of Q with respect to \beta_{j}. Then, we set each of
the partial derivatives equal to 0 and the parameters \beta_{k} are each replaced by
\hat{\beta}_{k}. The functions to be solved are nonlinear in the parameter estimates
\hat{\beta}_{k} and are often difficult to solve, even in the simplest cases. Hence, iterative
numerical methods are often employed. Even more difficulty arises in that multiple solutions
may be possible!
Newton's method, a classical method based on a gradient approach but which can be
computationally challenging and heavily dependent on good starting values.
The Gauss-Newton algorithm, a modification of Newton's method that gives a good
approximation of the solution that Newton's method should have arrived at but which is
not guaranteed to converge.
The Levenberg-Marquardt method, which can take care of computational difficulties
arising with the other methods but can require a tedious search for the optimal value of
a tuning parameter.
\begin{equation*} y_{i}=\beta_{0}+\beta_{1}\exp(\beta_{2}x_{i,1}+\ldots+\beta_{p+1}x_{i,1})+
\epsilon_{i}, \end{equation*}
where the \epsilon_{i} are iid normal with mean 0 and constant variance \sigma^{2}. Notice
that if \beta_{0}=0, then the above is intrinsically linear by taking the natural logarithm of both
sides.
To illustrate, consider the example on long-term recovery after discharge from hospital [5] from
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
26 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
page 514 of Applied Linear Regression Models (4th ed) by Kutner, Nachtsheim, and Neter.
The response variable, Y, is the prognostic index for long-term recovery and the predictor
variable, X, is the number of days of hospitalization. The proposed model is the two-
parameter exponential model:
We'll use Minitab's nonlinear regression routine to apply the Gauss-Newton algorithm to
estimate \theta_0 and \theta_1. Before we do this, however, we have to find initial values for
\theta_0 and \theta_1. One way to do this is to note that we can linearize the response
function by taking the natural logarithm:
Thus we can fit a simple linear regression model with response, \log(Y), and predictor, X, and
the intercept (4.0372) gives us an estimate of \log(\theta_{0}) while the slope (-0.03797) gives
us an estimate of \theta_{1}. (We then calculate \exp(4.0372)=56.7 to estimate \theta_0.)
Select Stat > Regression > Nonlinear Regression, select prog for the
response, and click "Use Catalog" under "Expectation Function."
Select the "Exponential" function with 1 predictor and 2 parameters in the Catalog dialog
box and click OK to go to the "Choose Predictors" dialog.
Select days to be the "Actual predictor" and click OK to go back to the Catalog dialog
box, where you should see "Theta1 * exp( Theta2 * days )" in the "Expectation
Function" box.
Click "Parameters" and type "56.7" next to "Theta1" and "-0.038" next to "Theta2" and
click OK to go back to the Nonlinear Regression dialog box.
Click "Options" to confirm that Mintab will use the Gauss-Newton algorithm (the other
choice is Levenberg-Marquardt) and click OK to go back to the Nonlinear Regression
dialog box.
Click "Graphs" to confirm that Mintab will produce a plot of the fitted curve with data and
click OK to go back to the Nonlinear Regression dialog box.
Click OK to obtain the following output:
Method
Algorithm Gauss-Newton
Max iterations 200
Tolerance 0.00001
Equation
prog = 58.6066 * exp(-0.0395865 * days)
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
27 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Parameter Estimates
Parameter Estimate SE Estimate
Theta1 58.6066 1.47216
Theta2 -0.0396 0.00171
Summary
Iterations 5
Final SSE 49.4593
DFE 13
MSE 3.80456
S 1.95053
\begin{equation*} y_{i}=\frac{\beta_{1}}
{1+\exp(\beta_{2}+\beta_{3}x_{i})}+\epsilon_{i},
\end{equation*}
28 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
year population
1790 3.929
1800 5.308
1810 7.240
1820 9.638
1830 12.866
1840 17.069
1850 23.192
1860 31.443
1870 39.818
1880 50.156
1890 62.948
1900 75.995
1910 91.972
1920 105.711
1930 122.775
1940 131.669
1950 150.697
1960 179.323
1970 203.302
1980 226.542
1990 248.710
The data are graphed (see below) and the line represents the fit of the logistic population
growth model.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
29 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
To fit the logistic model to the U. S. Census data, we need starting values for the parameters.
It is often important in nonlinear least squares estimation to choose reasonable starting
values, which generally requires some insight into the structure of the model. We know that
\beta_{1} represents asymptotic population. The data in the plot above show that in 1990 the
U. S. population stood at about 250 million and did not appear to be close to an asymptote; so
as not to extrapolate too far beyond the data, let us set the starting value of \beta_{1} to 350.
It is convenient to scale time so that x_{1}=0 in 1790, and so that the unit of time is 10 years.
Then substituting \beta_{1}=350 and x=0 into the model, using the value y_{1}=3.929 from the
data, and assuming that the error is 0, we have
Solving for \beta_{2} gives us a plausible start value for this parameter:
Finally, returning to the data, at time x = 1 (i.e., at the second Census performed in 1800), the
population was y_{2}=5.308. Using this value, along with the previously determined start
values for \beta_{1} and \beta_{2}, and again setting the error to 0, we have
So now we have starting values for the nonlinear least squares algorithm that we use. Below
is the output from fitting the model in Minitab v17 using Gauss-Netwon:
30 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Click Parameters and type in the values specified above (350, 4.5, and –0.3)
As you can see, the starting values resulted in convergence with values not too far from our
guess.
As you can see, the logistic functional form that we chose did catch the gross characteristics
of this data, but some of the nuances appear to not be as well characterized. Since there are
indications of some cyclical behavior, a model incorporating correlated errors or, perhaps,
trigonometric functions could be investigated.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
31 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369
Links:
[1] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/leukemia_remission.txt
[2] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/DiseaseOutbreak.txt
[3] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/toxicity.txt
[4] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/poisson_simulated.txt
[5] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/recovery.txt
[6] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/us_census.txt
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
32 of 32 11-02-2018, 02:54