Sunteți pe pagina 1din 32

https://onlinecourses.science.psu.

edu/stat501/print/book/export/html/369

Published on STAT 501 (https://onlinecourses.science.psu.edu/stat501)


Home > Lesson 15: Logistic, Poisson & Nonlinear Regression

Lesson 15: Logistic, Poisson & Nonlinear


Regression
Introduction

Multiple linear regression can be generalized to handle a response variable that is categorical
or a count variable. This lesson covers the basics of such models, specifically logistic and
Poisson regression, including model fitting and inference.

Multiple linear regression, logistic regression, and Poisson regression are examples of
generalized linear models, which this lesson introduces briefly.

The lesson concludes with some examples of nonlinear regression, specifically exponential
regression and population growth models.

Learning objectives and outcomes

Apply logistic regression techniques to datasets with a binary response variable.


Apply Poisson regression techniques to datasets with a count response variable.
Understand the basics of fitting and inference for nonlinear regression methods when
the regression function acting on the predictors is not linear in the parameters.

15.1 - Logistic Regression


Logistic regression models a relationship between predictor variables and a categorical
response variable. For example, we could use logistic regression to model the relationship
between various measurements of a manufactured specimen (such as dimensions and
chemical composition) to predict if a crack greater than 10 mils will occur (a binary variable:
either yes or no). Logistic regression helps us estimate a probability of falling into a certain
level of the categorical response given a set of predictors. We can choose from three types of
logistic regression, depending on the nature of the categorical response variable:

Binary Logistic Regression:

Used when the response is binary (i.e., it has two possible outcomes). The cracking
example given above would utilize binary logistic regression. Other examples of binary
responses could include passing or failing a test, responding yes or no on a survey, and
having high or low blood pressure.

Nominal Logistic Regression:

Used when there are three or more categories with no natural ordering to the levels.
clude departments at a
business (e.g.,
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

1 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color
(black, red, blue, orange).

Ordinal Logistic Regression:

Used when there are three or more categories with a natural ordering to the levels, but
the ranking of the levels do not necessarily mean the intervals between them are equal.
Examples of ordinal responses could be how students rate the effectiveness of a college
course on a scale of 1-5, levels of flavors for hot wings, and medical condition (e.g., good,
stable, serious, critical).

Particular issues with modelling a categorical response variable include nonnormal error
terms, nonconstant error variance, and constraints on the response function (i.e., the
response is bounded between 0 and 1). We will investigate ways of dealing with these in the
binary logistic regression setting here. There is some discussion of the nominal and ordinal
logistic regression settings in Section 15.2.

The multiple binary logistic regression model is the following:

where here denotes a probability and not the irrational number 3.14....

is the probability that an observation is in a specified category of the binary Y variable,


generally called the "success probability."
Notice that the model describes the probability of an event happening as a function of X
variables. For instance, it might provide estimates of the probability that an older person
has heart disease.
With the logistic model, estimates of from equations like the one above will always be
between 0 and 1. The reasons are:
The numerator must be positive, because
it is a power of a positive value (e).
The denominator of the model is (1 + numerator), so the answer will always be
less than 1.
With one X variable, the theoretical model for has an elongated "S" shape (or
sigmoidal shape) with asymptotes at 0 and 1, although in sample estimates we may not
see this "S" shape if the range of the X variable is limited.

For a sample of size n, the likelihood for a binary logistic regression is given by:

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

2 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

This yields the log likelihood:

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.

To illustrate, consider data published on n = 27


leukemia patients. The data (leukemia_remission.txt [1])
has a response variable of whether leukemia remission
occurred (REMISS), which is given by a 1.

The predictor variables are cellularity of the marrow clot


section (CELL), smear differential percentage of blasts
(SMEAR), percentage of absolute marrow leukemia
cell infiltrate (INFIL), percentage labeling index of the
bone marrow leukemia cells (LI), absolute number of
blasts in the peripheral blood (BLAST), and the highest
temperature prior to start of treatment (TEMP).

The following gives the estimated logistic regression equation and associated significance
tests from Minitab v17:

Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.
Select "REMISS" for the Response (the response event for remission is 1 for this data).
Select all the predictors as Continuous predictors.
Click Options and choose Deviance or Pearson residuals for diagnostic plots.
Click Graphs and select "Residuals versus order."
Click Results and change "Display of results" to "Expanded tables."
Click Storage and select "Coefficients."

This results in the following output:

Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant 64.3 75.0 ( -82.7, 211.2) 0.86 0.391
CELL 30.8 52.1 ( -71.4, 133.0) 0.59 0.554 62.46
SMEAR 24.7 61.5 ( -95.9, 145.3) 0.40 0.688 434.42
INFIL -25.0 65.3 (-152.9, 103.0) -0.38 0.702 471.10
LI 4.36 2.66 ( -0.85, 9.57) 1.64 0.101 4.43
BLAST -0.01 2.27 ( -4.45, 4.43) -0.01 0.996 4.18
TEMP -100.2 77.8 (-252.6, 52.2) -1.29 0.198 3.01

Wald Test

The Wald test is the test of significance for individual regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum likelihood estimates,
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

3 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

the ratio

can be used to test . The standard normal curve is used to determine the
-value of the test. Furthermore, confidence intervals can be constructed as

Estimates of the regression coefficients, , are given in the Minitab output Coefficients table
in the column labeled "Coef." This table also gives coefficient p-values based on Wald tests.
The index of the bone marrow leukemia cells (LI) has the smallest p-value and so appears to
be closest to a significant predictor of remission occurring. After looking at various subsets of
the data, we find that a good model is one which only includes the labeling index as a
predictor:

Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant -3.78 1.38 (-6.48, -1.08) -2.74 0.006
LI 2.90 1.19 ( 0.57, 5.22) 2.44 0.015 1.00

Regression Equation
P(1) = exp(Y')/(1 + exp(Y'))
Y' = -3.78 + 2.90 LI

Since we only have a single predictor in this model we can create a Binary Fitted Line Plot to
visualize the sigmoidal shape of the fitted logistic regression curve:

Odds, Log Odds, and Odds Ratio

There are algebraically equivalent ways to write the logistic regression model:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

4 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

The first is

which is an equation that describes the odds of being in the current category of interest. By
definition, the odds for an event is π / (1 - π) such that π is the probability of the event. For
example, if you are at the racetrack and there is a 80% chance that a certain horse will win
the race, then his odds are 0.80 / (1 - 0.80) = 4, or 4:1.

The second is

which states that the (natural) logarithm of the odds is a linear function of the X variables (and
is often called the log odds). This is also referred to as the logit transformation of the
probability of success, .

The odds ratio (which we will write as ) between the odds for two sets of predictors (say
and ) is given by

For binary logistic regression, the odds of success are:

By plugging this into the formula for above and setting equal to except in one
position (i.e., only one predictor differs by one unit), we can determine the relationship
between that predictor and the response. The odds ratio can be any nonnegative number. An
odds ratio of 1 serves as the baseline for comparison and indicates there is no association
between the response and predictor. If the odds ratio is greater than 1, then the odds of
success are higher for higher levels of a continuous predictor (or for the indicated level of a
factor). In particular, the odds increase multiplicatively by for every one-unit increase
in . If the odds ratio is less than 1, then the odds of success are less for higher levels of a
continuous predictor (or for the indicated level of a factor). Values farther from 1 represent
stronger degrees of association.

For example, when there is just a single predictor, , the odds of success are:

If we increase by one unit, the odds ratio is

To illustrate, the relevant Minitab output from the leukemia example is:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

5 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Odds Ratios for Continuous Predictors


Odds Ratio 95% CI
LI 18.1245 (1.7703, 185.5617)

The odds ratio for LI of 18.1245 is calculated as (you can view more decimal
places for the coefficient estimates in Minitab by clicking "Storage" in the Regression Dialog
and selecting "Coefficients"). The 95% confidence interval is calculated as
, where is the percentile from the
standard normal distribution. The interpretation of the odds ratio is that for every increase of 1
unit in LI, the estimated odds of leukemia remission are multiplied by 18.1245. However, since
the LI appears to fall between 0 and 2, it may make more sense to say that for every .1 unit
increase in L1, the estimated odds of remission are multiplied by
. Then

At LI=0.9, the estimated odds of leukemia remission is


.
At LI=0.8, the estimated odds of leukemia remission is
.
The resulting odds ratio is , which is the ratio of the odds of remission
when LI=0.9 compared to the odds when L1=0.8.

Notice that , which demonstrates the multiplicative effect by


on the odds.

Likelihood Ratio (or Deviance) Test

The likelihood ratio test is used to test the null hypothesis that any subset of the 's is equal
to 0. The number of 's in the full model is p, while the number of 's in the reduced model is
r. (Remember the reduced model is the model that results when the 's in the null hypothesis
are set to 0.) Thus, the number of 's being tested in the null hypothesis is . Then the
likelihood ratio test statistic is given by:

where is the log likelihood of the fitted (full) model and is the log likelihood of the
(reduced) model specified by the null hypothesis evaluated at the maximum likelihood
estimate of that reduced model. This test statistic has a distribution with degrees of
freedom. Minitab presents results for this test in terms of "deviance," which is defined as
times log-likelihood. The notation used for the test statistic is typically = deviance
(reduced) – deviance (full).

This test procedure is analagous to the general linear F test procedure for multiple linear
regression. However, note that when testing a single coefficient, the Wald test and likelihood
ratio test will not in general give identical results.

To illustrate, the relevant software output from the leukemia example is:

Deviance Table
Source DF Adj Dev Adj Mean Chi-Square P-Value
Regression 1 8.299 8.299 8.30 0.004
LI [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
Loading 1 8.299 8.299 8.30 0.004

6 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Error 25 26.073 1.043


Total 26 34.372

Since there is only a single predictor for this example, this table simply provides information
on the likelihood ratio test for LI (p-value of 0.004), which is similar but not identical to the
earlier Wald test result (p-value of 0.015). The Deviance Table includes the following:

The null (reduced) model in this case has no predictors, so the fitted probabilities are
simply the sample proportion of successes, . The log-likelihood for
the null model is , so the deviance for the null model is
, which is shown in the "Total" row in the Deviance Table.
The log-likelihood for the fitted (full) model is , so the deviance for
the fitted model is , which is shown in the "Error" row in the
Deviance Table.
The likelihood ratio test statistic is therefore
, which is the same as
.
The p-value comes from a distribution with degrees of freedom.

When using the likelihood ratio (or deviance) test for more than one regression coefficient, we
can first fit the "full" model to find deviance (full), which is shown in the "Error" row in the
resulting full model Deviance Table. Then fit the "reduced" model (corresponding to the model
that results if the null hypothesis is true) to find deviance (reduced), which is shown in the
"Error" row in the resulting reduced model Deviance Table. For example, the relevant
Deviance Tables for the Disease Outbreak [2] example on pages 581-582 of Applied Linear
Regression Models (4th ed) by Kutner et al are:

Full model:

Source DF Adj Dev Adj Mean Chi-Square P-Value


Regression 9 28.322 3.14686 28.32 0.001
Error 88 93.996 1.06813
Total 97 122.318

Reduced model:

Source DF Adj Dev Adj Mean Chi-Square P-Value


Regression 4 21.263 5.3159 21.26 0.000
Error 93 101.054 1.0866
Total 97 122.318

Here the full model includes four single-factor predictor terms and five two-factor interaction
terms, while the reduced model excludes the interaction terms. The test statistic for testing
the interaction terms is , which is compared to a chi-
square distribution with degrees of freedom to find the p-value > 0.05 (meaning
the interaction terms are not significant).

Alternatively, select the corresponding predictor terms last in the full model and request
Minitab to output Sequential (Type I) Deviances. Then add the corresponding Sequential
Deviances in the resulting Deviance Table to calculate . For example, the relevant
Deviance Table for the Disease Outbreak [2] example is:
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

7 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Source DF Seq Dev Seq Mean Chi-Square P-Value


Regression 9 28.322 3.1469 28.32 0.001
Age 1 7.405 7.4050 7.40 0.007
Middle 1 1.804 1.8040 1.80 0.179
Lower 1 1.606 1.6064 1.61 0.205
Sector 1 10.448 10.4481 10.45 0.001
Age*Middle 1 4.570 4.5697 4.57 0.033
Age*Lower 1 1.015 1.0152 1.02 0.314
Age*Sector 1 1.120 1.1202 1.12 0.290
Middle*Sector 1 0.000 0.0001 0.00 0.993
Lower*Sector 1 0.353 0.3531 0.35 0.552
Error 88 93.996 1.0681
Total 97 122.318

The test statistic for testing the interaction terms is


, the same as in the first
calculation.

Goodness-of-Fit Tests

Overall performance of the fitted model can be measured by several different goodness-of-fit
tests. Two tests that require replicated data (multiple observations with the same values for all
the predictors) are the Pearson chi-square goodness-of-fit test and the deviance
goodness-of-fit test (analagous to the multiple linear regression lack-of-fit F-test). Both of
these tests have statistics that are approximately chi-square distributed with c - p degrees of
freedom, where c is the number of distinct combinations of the predictor variables. When a
test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of
lack of fit.

By contrast, the Hosmer-Lemeshow goodness-of-fit test is useful for unreplicated datasets


or for datasets that contain just a few replicated observations. For this test the observations
are grouped based on their estimated probabilities. The resulting test statistic
is approximately chi-square distributed with c - 2 degrees of freedom, where c is the number
of groups (generally chosen to be between 5 and 10, depending on the sample size).

To illustrate, the relevant Minitab output from the leukemia example is:

Goodness-of-Fit Tests
Test DF Chi-Square P-Value
Deviance 25 26.07 0.404
Pearson 25 23.93 0.523
Hosmer-Lemeshow 7 6.87 0.442

Since there is no replicated data for this example, the deviance and Pearson goodness-of-fit
tests are invalid, so the first two rows of this table should be ignored. However, the Hosmer-
Lemeshow test does not require replicated data so we can interpret its high p-value as
indicating no evidence of lack-of-fit.

R2

The calculation
Loading of R2 used in linear regression does not extend directly
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js to logistic regression.

8 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

One version of R2 used in logistic regression is defined as

where is the log likelihood of the model when only the intercept is included and
is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data).
This R2 does go from 0 to 1 with 1 being a perfect fit. With unreplicated data, , so
the formula simplifies to:

To illustrate, the relevant Minitab output from the leukemia example is:

Model Summary
Deviance Deviance
R-Sq R-Sq(adj) AIC
24.14% 21.23% 30.07

Recall from above that and , so:

Note that we can obtain the same result by simply using deviances instead of log-likelihoods
since the factor cancels out:

Raw Residual

The raw residual is the difference between the actual response and the estimated probability
from the model. The formula for the raw residual is

Pearson Residual

The Pearson residual corrects for the unequal variance in the raw residuals by dividing by
the standard deviation. The formula for the Pearson residuals is

Deviance Residuals

Deviance residuals are also popular because the sum of squares of these residuals is the
deviance
Loading statistic. The formula for the deviance residual is
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

9 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Here are the plots of the Pearson residuals and deviance residuals for the leukemia example.
There are no alarming patterns in these plots to suggest a major problem with the model.

Hat Values

The hat matrix serves a similar purpose as in the case of linear regression – to measure the
influence of each observation on the overall fit of the model – but the interpretation is not as
clear due to its more complicated form. The hat values (leverages) are given by

where W is an diagonal matrix with the values of for on the


diagonal.
Loading As before, we should investigate any observations with
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js or, failing this,

10 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

any observations with and very isolated.

Studentized Residuals

We can also report Studentized versions of some of the earlier residuals. The Studentized
Pearson residuals are given by

and the Studentized deviance residuals are given by

C and

C and are extensions of Cook's distance for logistic regression. C measures the overall
change in fitted logits due to deleting the observation for all points including the one
deleted, while excludes the deleted point. They are defined by:

and

To illustrate, the relevant Minitab output from the leukemia example is:

Fits and Diagnostics for Unusual Observations


Observed
Obs Probability Fit SE Fit 95% CI Resid Std Resid
Del Resid HI
8 0.000 0.849 0.139 (0.403, 0.979) -1.945
-2.11 -2.19 0.149840
Obs Cook’s D DFITS
8 0.58 -1.08011 R
R Large residual

The default residuals in this output (set under Minitab's Regression Options) are deviance
residuals, so observation 8 has a deviance residual of , a studentized deviance
residual of , a leverage (h) of , and a Cook's distance (C) of 0.58.

DFDEV and DFCHI

DFDEV and DFCHI are statistics that measure the change in deviance and in Pearson's chi-
square, respectively, that occurs when an observation is deleted from the data set. Large
values of these statistics indicate observations that have not been fitted well. The formulas for
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

11 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

these statistics are

and

15.2 - Polytomous Regression


Note that the material in this section is more technical than preceding Lessons. It is
offered as an introduction to more advanced topics and, given the technical nature of
the material, it could be considered optional in the context of this course.

In binary logistic regression, we only had two possible outcomes. For polytomous logistic
regression, we will consider the possibility of having k > 2 possible outcomes. (Note: The
word polychotomous is sometimes used, but note that this is not actually a word!)

Nominal Logistic Regression

The multiple nominal logistic regression model (sometimes called the multinomial logistic
regression model) is given by the following:

where again denotes a probability and not the irrational number. Notice that k - 1 of the
groups have their own set of values. Furthermore, since , we set the
values for group 1 to be 0 (this is what we call the reference group). Notice that when k = 2,
we are back to binary logistic regression.

is the probability that an observation is in one of k categories. The likelihood for the
nominal logistic regression model is given by:

where the subscript means the observation belongs to the group. This yields the
log likelihood:

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

12 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.

An odds ratio ( ) of 1 serves as the baseline for comparison. If , then there is no


association between the response and predictor. If , then the odds of success are
higher for the indicated level of the factor (or for higher levels of a continuous predictor). If
, then the odds of success are less for the indicated level of the factor (or for higher
levels of a continuous predictor). Values farther from 1 represent stronger degrees of
association. For nominal logistic regression, the odds of success (at two different levels of the
predictors, say and ) are:

Many of the procedures discussed in binary logistic regression can be extended to nominal
logistic regression with the appropriate modifications.

Ordinal Logistic Regression

For ordinal logistic regression, we again consider k possible outcomes as in nominal logistic
regression, except that the order matters. The multiple ordinal logistic regression model is the
following:

such that , , and again denotes a probability. Notice that


this model is a cumulative sum of probabilities which involves just changing the intercept of
the linear regression portion (so is now (p - 1)-dimensional and X is such that
first column of this matrix is not a column of 1's). Also, it still holds that .

\pi_{j} is still the probability that an observation is in one of k categories, but we are
constrained by the model written in the equation above. The likelihood for the ordinal logistic
regression model is given by:

\begin{align*} L(\beta;\textbf{y},\textbf{X})&=\prod_{i=1}^{n}\prod_{j=1}^{k}\pi_{i,j}^{y_{i,j}}
(1-\pi_{i,j})^{1-y_{i,j}}, \end{align*}

where the subscript (i, j) means the i^{\textrm{th}} observation belongs to the j^{\textrm{th}}
group. This yields the log likelihood:

\begin{equation*} \ell(\beta)=\sum_{i=1}^{n}\sum_{j=1}^{k}y_{i,j}\pi_{i,j}. \end{equation*}

Notice that this is identical to the nominal logistic regression likelihood. Thus, maximization
again has no closed-form solution, so we defer to a procedure like iteratively reweighted least
squares.

For ordinal logistic regression, a proportional odds model is used to determine the odds ratio.
Again, an odds ratio (\theta) of 1 serves as the baseline for comparison between the two
predictor levels, say \textbf{X}_{(1)} and \textbf{X}_{(2)}. Only one parameter and one odds
ratio is[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js
Loading calculated for each predictor. Suppose we are interested in calculating the odds of

13 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

\textbf{X}_{(1)} to \textbf{X}_{(2)}. If \theta=1, then there is no association between the


response and these two predictors. If \theta>1, then the odds of success are higher for the
predictor \textbf{X}_{(1)}. If \theta<1, then the odds of success are less for the predictor
\textbf{X}_{(1)}. Values farther from 1 represent stronger degrees of association. For ordinal
logistic regression, the odds ratio utilizes cumulative probabilities and their complements and
is given by:

\begin{equation*} \theta=\frac{\sum_{j=1}^{k^{*}}\pi_{j}|_{\textbf{X}=\textbf{X}_{(1)}}/(1-
\sum_{j=1}^{k^{*}}\pi_{j})|_{\textbf{X}=\textbf{X}_{(1)}}}{\sum_{j=1}^{k^{*}}\pi_{j}|_{\textbf{X}=
\textbf{X}_{(2)}}/(1\sum_{j=1}^{k^{*}}\pi_{j})|_{\textbf{X}=\textbf{X}_{(2)}}}. \end{equation*}

15.3 - Further Logistic Regression


Examples
Example 1: STAT 200 Dataset

Students in STAT 200 at Penn State were asked if they have ever driven after drinking
(dataset unfortunately no longer available). They also were asked, “How many days per
month do you drink at least two beers?” In the following discussion, \pi = the probability a
student says “yes” they have driven after drinking. This is modeled using X = days per month
of drinking two beers. Results from Minitab were as follows.

Some things to note from the results are:

We see that in the sample 122/249 students said they have driven after drinking.
(Yikes!)
Parameter estimates, given under Coef are \hat{\beta}_0 = −1.5514, and \hat{\beta}_1 =
0.19031.
The model for estimating \pi = the probability of ever having driven after drinking is

\hat{\pi}=\frac{\exp(-1.5514+0.19031X)}{1+\exp(-1.5514+0.19031X)}

The variable X = DaysBeer is a statistically significant predictor (Z = 6.46, P = 0.000).

We can also obtain a plot of the estimated probability of ever having driven under the
influence (\pi) versus days per month of drinking at least two beers.

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

14 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

The vertical axis shows the probability of ever having driven after drinking. For example, if X =
4 days per month of drinking beer, then the estimated probability is calculated as:

\hat{\pi}=\frac{\exp(-1.5514+0.19031(4))}{1+\exp(-1.5514+0.19031(4))}=\frac{\exp(-0.79016)}
{1+\exp(-0.79016)}=0.312

A few of these estimated probabilities are given in the following table:

DaysBeer 4 12 20 28

\hat{\pi} 0.312 0.675 0.905 0.97

In the results given above, we see that the estimate of the odds ratio is 1.21 for DaysBeer.
This is given under Odds Ratio in the table of coefficients, standard errors and so on. The
sample odds ratio was calculated as e0.19031. The interpretation of the odds ratio is that for
each increase of one day of drinking beer per month, the predicted odds of having ever driven
after drinking are multiplied by 1.21.

Above we found that at X = 4, the predicted probability of ever driving after drinking is \hat{\pi}
= 0.312. Thus when X = 4, the predicted odds of ever driving after drinking is 0.312/(1 −
0.312) = 0.453. To find the odds when X = 5, one method would be to multiply the odds at X =
4 by the sample odds ratio. The calculation is 1.21 × 0.453 = 0.549. (Another method is to just
do the calculation as we did for X = 4.)

Notice also, that the results give a 95% confidence interval estimate of the odd ratio (1.14 to
1.28).

We now include Gender (male or female) as an x-variable (along with DaysBeer). Some
Minitab results are given below. Under Gender, the row for male is explaining that the
program created an indicator variable with a value of 1 if the student is male and a value of 0
if the student is female.

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

15 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Some things to note from the results are:

The p-values are less than 0.05 for both DaysBeer and Gender. This is evidence that
both x-variables are useful for predicting the probability of ever having driven after
drinking.
For DaysBeer, the odds ratio is still estimated to equal 1.21 to two decimal places
(calculated as e0.18693).
For Gender, the odds ratio is 1.85 (calculated as e0.6172). For males, the odds of ever
having driven after drinking is 1.85 times the odds for females, assuming DaysBeer is
held constant.

Finally, the results for testing with respect to the multiple logistic regression model are as
follows:

Notice that since we have a p-value of 0.000 for this chi-square test, we therefore reject the
null hypothesis that all of the slopes are equal to 0.

Example 2: Toxicity Dataset

An experiment is done to test the effect of a toxic


substance on insects. The data originate from the
textbook, Applied Linear Statistical Models by Kutner,
Nachtsheim, Neter, & Li.

At each of six dose levels, 250 insects are exposed to


the substance and the number of insects that die is
counted (toxicity.txt [3]). We can use Minitab to calculate
the observed probabilities as the number of observed
deaths out of 250 for each dose level.

A binary logistic regression model is used to describe the connection between the observed
probabilities of death as a function of dose level. Since the data is in event/trial format the
procedure in Minitab v17 is a little different to before:

Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model
Select "Response is in event/trial format"
Select "Deaths" for Number of events, "SampSize" for Number of trials (and type
"Death" for Event name if you like)
Select Dose as a Continuous predictor
Click Results and change "Display of results" to "Expanded tables"
Click Storage and select "Fits (event probabilities)"

The Minitab output is as follows:

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

16 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Thus

\hat{\pi}=\frac{\exp(-2.644+0.674X)}{1+\exp(-2.644+0.674X)}

where X = Dose and \hat{\pi} is the estimated probability the insect dies (based on the
model).

Predicted probabilities of death (based on the logistic model) for the six dose levels are given
below (FITS1). These probabilities closely agree with the observed values (Observed p)
reported.

As an example of calculating the estimated probabilities, for Dose 1, we have

\hat{\pi}=\frac{\exp(-2.644+0.674(1))}{1+\exp(-2.644+0.674(1))}=0.1224

The odds ratio for Dose is 1.9621, the value under Odds Ratio in the output. It was calculated
as e0.674. The interpretation of the odds ratio is that for every increase of 1 unit in dose level,
the estimated odds of insect death are multiplied by 1.9621.

As an example of odds and odds ratio:

At Dose = 1, the estimated odds of death is \hat{\pi}/(1− \hat{\pi}) = 0.1224/(1−0.1224) =


0.1395.
At Dose = 2, the estimated odds of death is \hat{\pi}/(1− \hat{\pi}) = 0.2149/(1−0.2149) =
0.2737.
The Odds Ratio = \frac{0.2737}{0.1395}, which is the ratio of the odds of death when
Dose = 2 compared to the odds when Dose = 1.

A property of the binary logistic regression model is that the odds ratio is the same for any
increase of one unit in X, regardless of the specific values of X.

15.4 - Poisson Regression


The Poisson
Loading distribution for a random variable Y has the following probability
[MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js mass function

17 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

for a given value Y = y:

\begin{equation*} \mbox{P}(Y=y|\lambda)=\frac{e^{-\lambda}\lambda^{y}}{y!}, \end{equation*}

for y=0,1,2,\ldots. Notice that the Poisson distribution is characterized by the single parameter
\lambda, which is the mean rate of occurrence for the event being measured. For the Poisson
distribution, it is assumed that large counts (with respect to the value of \lambda) are rare.

In Poisson regression the dependent variable (Y) is an observed count that follows the
Poisson distribution. The rate \lambda is determined by a set of p-1 predictors \textbf{X}=
(X_{1},\ldots,X_{p-1}). The expression relating these quantities is

\begin{equation*} \lambda=\exp\{\textbf{X}\beta\}. \end{equation*}

Thus, the fundamental Poisson regression model for observation i is given by

\begin{equation*} \mbox{P}(Y_{i}=y_{i}|\textbf{X}_{i},\beta)=\frac{e^{-\exp\{\textbf{X}_{i}\beta\}}
\exp\{\textbf{X}_{i}\beta\}^{y_{i}}}{y_{i}!}. \end{equation*}

That is, for a given set of predictors, the categorical outcome follows a Poisson distribution
with rate \exp\{\textbf{X}\beta\}. For a sample of size n, the likelihood for a Poisson regression
is given by:

\begin{equation*} L(\beta;\textbf{y},\textbf{X})=\prod_{i=1}^{n}\frac{e^{-\exp\{\textbf{X}_{i}
\beta\}}\exp\{\textbf{X}_{i}\beta\}^{y_{i}}}{y_{i}!}. \end{equation*}

This yields the log likelihood:

\begin{equation*} \ell(\beta)=\sum_{i=1}^{n}y_{i}\textbf{X}_{i}\beta-\sum_{i=1}^{n}
\exp\{\textbf{X}_{i}\beta\}-\sum_{i=1}^{n}\log(y_{i}!). \end{equation*}

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
\hat{\beta}. Once this value of \hat{\beta} has been obtained, we may proceed to define
various goodness-of-fit measures and calculated residuals. For the residuals we present, they
serve the same purpose as in linear regression. When plotted versus the response, they will
help identify suspect data points.

To illustrate consider this example (poisson_simulated.txt [4]), which consists of a simulated


data set of size n = 30 such that the response (Y) follows a Poisson distribution with rate
\lambda=\exp\{0.50+0.07X\}. A plot of the response versus the predictor is given below.

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

18 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

The following gives the analysis of the Poisson regression data in Minitab v17:

Select Stat > Regression > Poisson Regression > Fit Poisson Model.
Select "y" for the Response.
Select "x" as a Continuous predictor.
Click Results and change "Display of results" to "Expanded tables."

This results in the following output:

Coefficients
Term Coef SE Coef 95% CI Z-Value P-Value VIF
Constant 0.308 0.289 (-0.259, 0.875) 1.06 0.287
x 0.0764 0.0173 (0.0424, 0.1103) 4.41 0.000 1.00

Regression Equation
y = exp(Y')
Y' = 0.308 + 0.0764 x

As you can see, the Wald test p-value for x of 0.000 indicates that the predictor is highly
significant.

Deviance Test

Changes in the deviance can be used to test the null hypothesis that any subset of the \beta's
is equal to 0. The deviance, D(\hat{\beta}), is -2 times the difference between the log-
likelihood evaluated at the maximum likelihood estimate and the log-likelihood for a "saturated
model" (a theoretical model with a separate parameter for each observation and thus a
perfect fit). Suppose we test that r < p of the \beta's are equal to 0. Then the deviance test
statistic is given by:

\begin{equation*} G^2=D(\hat{\beta}^{(0)})-D(\hat{\beta}), \end{equation*}

where D(\hat{\beta}) is the deviance of the fitted (full) model and D(\hat{\beta}^{(0)}) is the
deviance of the model specified by the null hypothesis evaluated at the maximum likelihood
estimate of that reduced model. This test statistic has a \chi^{2} distribution with p-r degrees
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

19 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

of freedom. This test procedure is analagous to the general linear F test procedure for
multiple linear regression.

To illustrate, the relevant Minitab output from the simulated example is:

Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square
P-Value
Regression 1 20.47 42.37% 20.47 20.4677
20.47 0.000
x 1 20.47 42.37% 20.47 20.4677
20.47 0.000
Error 28 27.84 57.63% 27.84 0.9944
Total 29 48.31 100.00%

Since there is only a single predictor for this example, this table simply provides information
on the deviance test for x (p-value of 0.000), which matches the earlier Wald test result (p-
value of 0.000). (Note that the Wald test and deviance test will not in general give identical
results.) The Deviance Table includes the following:

The null model in this case has no predictors, so the fitted values are simply the sample
mean, 4.233. The deviance for the null model is D(\hat{\beta}^{(0)})=48.31, which is
shown in the "Total" row in the Deviance Table.
The deviance for the fitted model is D(\hat{\beta})=27.84, which is shown in the "Error"
row in the Deviance Table.
The deviance test statistic is therefore G^2=48.31-27.84=20.47.
The p-value comes from a \chi^{2} distribution with 2-1=1 degrees of freedom.

Goodness-of-Fit

Overall performance of the fitted model can be measured by two different chi-square tests.
There is the Pearson statistic

\begin{equation*} X^2=\sum_{i=1}^{n}\frac{(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})^{2}}{\exp\
{\textbf{X}_{i}\hat{\beta}\}} \end{equation*}

and the deviance statistic

\begin{equation*} D=2\sum_{i=1}^{n}\biggl[y_{i}\log\biggl(\frac{y_{i}}{\exp\{\textbf{X}_{i}
\hat{\beta}\}}\biggr)-(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})\biggr]. \end{equation*}

Both of these statistics are approximately chi-square distributed with n - p degrees of


freedom. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there
is no evidence of lack-of-fit.

To illustrate, the relevant Minitab output from the simulated example is:

Goodness-of-Fit Tests
Test DF Estimate Mean Chi-Square P-Value
Deviance 28 27.84209 0.99436 27.84 0.473
Pearson 28 26.09324 0.93190 26.09 0.568

The high p-values indicate no evidence of lack-of-fit.


Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

20 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Overdispersion means that the actual covariance matrix for the observed data exceeds that
for the specified model for Y|\textbf{X}. For a Poisson distribution, the mean and the variance
are equal. In practice, the data almost never reflects this fact and we have overdispersion in
the Poisson regression model if (as is often the case) the variance is greater than the mean.
In addition to testing goodness-of-fit, the Pearson statistic can also be used as a test of
overdispersion. Note that overdispersion can also be measured in the logistic regression
models that were discussed earlier.

Pseudo R2

The value of R2 used in linear regression also does not extend to Poisson regression. One
commonly used measure is the pseudo R2, defined as

\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}=1-
\frac{D(\hat{\beta})}{D(\hat{\beta_{0}})}, \end{equation*}

where \ell(\hat{\beta_{0}}) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.

To illustrate, the relevant Minitab output from the simulated example is:

Model Summary
Deviance Deviance
R-Sq R-Sq(adj) AIC
42.37% 40.30% 124.50

Recall from above that D(\hat{\beta})=27.84 and D(\hat{\beta}^{(0)})=48.31, so:

\begin{equation*} R^{2}=1-\frac{27.84}{48.31}=0.4237. \end{equation*}

Raw Residual

The raw residual is the difference between the actual response and the estimated value from
the model. Remember that the variance is equal to the mean for a Poisson random variable.
Therefore, we expect that the variances of the residuals are unequal. This can lead to
difficulties in the interpretation of the raw residuals, yet it is still used. The formula for the raw
residual is

\begin{equation*} r_{i}=y_{i}-\exp\{\textbf{X}_{i}\beta\}. \end{equation*}

Pearson Residual

The Pearson residual corrects for the unequal variance in the raw residuals by dividing by
the standard deviation. The formula for the Pearson residuals is

\begin{equation*} p_{i}=\frac{r_{i}}{\sqrt{\hat{\phi}\exp\{\textbf{X}_{i}\beta\}}}, \end{equation*}

where

\begin{equation*} \hat{\phi}=\frac{1}{n-p}\sum_{i=1}^{n}\frac{(y_{i}-\exp\{\textbf{X}_{i}
\hat{\beta}\})^{2}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}. \end{equation*}

\hat{\phi} is a dispersion parameter to help control overdispersion.


Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

21 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Deviance Residuals

Deviance residuals are also popular because the sum of squares of these residuals is the
deviance statistic. The formula for the deviance residual is

\begin{equation*} d_{i}=\texttt{sgn}(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}\})\sqrt{2\biggl\{y_{i}
\log\biggl(\frac{y_{i}}{\exp\{\textbf{X}_{i}\hat{\beta}\}}\biggr)-(y_{i}-\exp\{\textbf{X}_{i}\hat{\beta}
\})\biggr\}}. \end{equation*}

The plots below show the Pearson residuals and deviance residuals versus the fitted values
for the simulated example.

These plots appear to be good for a Poisson fit. Further diagnostic plots can also be produced
and model selection techniques can be employed when faced with multiple predictors.

Hat Values

the case of linear regression


Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js - to measure the

22 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

influence of each observation on the overall fit of the model. The hat values, h_{i,i}, are the
diagonal entries of the Hat matrix

\begin{equation*} H=\textbf{W}^{1/2}\textbf{X}(\textbf{X}\textbf{W}\textbf{X})^{-1}\textbf{X}
\textbf{W}^{1/2}, \end{equation*}

where W is an n\times n diagonal matrix with the values of \exp\{\textbf{X}_{i}\hat{\beta}\} on


the diagonal. As before, a hat value (leverage) is large if h_{i,i}>2p/n.

Studentized Residuals

Finally, we can also report Studentized versions of some of the earlier residuals. The
Studentized Pearson residuals are given by

\begin{equation*} sp_{i}=\frac{p_{i}}{\sqrt{1-h_{i,i}}} \end{equation*}

and the Studentized deviance residuals are given by

\begin{equation*} sd_{i}=\frac{d_{i}}{\sqrt{1-h_{i, i}}}. \end{equation*}

To illustrate, the relevant Minitab output from the simulated example is:

Fits and Diagnostics for Unusual Observations


Obs y Fit SE Fit 95% CI Resid Std Resid Del
Resid HI Cook’s D
8 10.000 4.983 0.452 (4.171, 5.952) 1.974 2.02
2.03 0.040969 0.11
21 6.000 8.503 1.408 (6.147, 11.763) -0.907 -1.04
-1.02 0.233132 0.15
Obs DFITS
8 0.474408 R
21 -0.540485 X
R Large residual
X Unusual X

The default residuals in this output (set under Minitab's Regression Options) are deviance
residuals, so observation 8 has a deviance residual of 1.974 and a studentized deviance
residual of 2.02, while observation 21 has a leverage (h) of 0.233132.

15.5 - Generalized Linear Models


All of the regression models we have considered (including multiple linear, logistic, and
Poisson) actually belong to a family of models called generalized linear models. (In fact, a
more "generalized" framework for regression models is called general regression models,
which includes any parametric regression model.) Generalized linear models provides a
generalization of ordinary least squares regression that relates the random term (the
response Y) to the systematic term (the linear predictor \textbf{X}\beta) via a link function
(denoted by g(\cdot)). Specifically, we have the relation

\begin{equation*} \mbox{E}(Y)=\mu=g^{-1}(\textbf{X}\beta), \end{equation*}

so g(\mu)=\textbf{X}\beta. Some common link functions are:


Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

23 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

The identity link:

\begin{equation*} g(\mu)=\mu=\textbf{X}\beta, \end{equation*}

which is used in traditional linear regression.

The logit link:

\begin{align*} &g(\mu)=\log\biggl(\frac{\mu}{1-\mu}\biggr)=\textbf{X}\beta\\ &\Rightarrow


\mu=\frac{e^{\textbf{X}\beta}}{1+e^{\textbf{X}\beta}}, \end{align*}

which is used in logistic regression.

The log link:

\begin{align*} &g(\mu)=\log(\mu)=\textbf{X}\beta\\ &\Rightarrow\mu=e^{\textbf{X}\beta},


\end{align*}

which is used in Poisson regression.

The probit link:

\begin{align*} &g(\mu)=\Phi^{-1}(\mu)=\textbf{X}\beta\\ &\Rightarrow\mu=\Phi(\textbf{X}


\beta), \end{align*}

where \Phi(\cdot) is the cumulative distribution function of the standard normal


distribution. This link function is also sometimes called the normit link. This also can be
used in logistic regression.

The complementary log-log link:

\begin{align*} &g(\mu)=\log(-\log(1-\mu))=\textbf{X}\beta\\ &\Rightarrow\mu=1-


\exp\{-e^{\textbf{X}\beta}\}, \end{align*}

which can also be used in logistic regression. This link function is also sometimes called
the gompit link.

The power link:

\begin{align*} &g(\mu)=\mu^{\lambda}=\textbf{X}\beta\\ &\Rightarrow\mu=(\textbf{X}


\beta)^{1/\lambda}, \end{align*}

where \lambda\neq 0. This is used in other regressions which we do not explore (such as
gamma regression and inverse Gaussian regression).

Also, the variance is typically a function of the mean and is often written as

\begin{equation*} \mbox{Var}(Y)=V(\mu)=V(g^{-1}(\textbf{X}\beta)). \end{equation*}

The random variable Y is assumed to belong to an exponential family distribution where the
density can be expressed in the form

\begin{equation*} q(y;\theta,\phi)=\exp\biggl\{\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\biggr\},
\end{equation*}

where a(\cdot), b(\cdot), and c(\cdot) are specified functions, \theta is a parameter related to
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

24 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

the mean of the distribution, and \phi is called the dispersion parameter. Many probability
distributions belong to the exponential family. For example, the normal distribution is used for
traditional linear regression, the binomial distribution is used for logistic regression, and the
Poisson distribution is used for Poisson regression. Other exponential family distributions lead
to gamma regression, inverse Gaussian (normal) regression, and negative binomial
regression, just to name a few.

The unknown parameters, \beta, are typically estimated with maximum likelihood techniques
(in particular, using iteratively reweighted least squares), Bayesian methods, or quasi-
likelihood methods. The quasi-likelihood is a function which possesses similar properties to
the log-likelihood function and is most often used with count or binary data. Specifically, for a
realization y of the random variable Y, it is defined as

\begin{equation*} Q(\mu;y)=\int_{y}^{\mu}\frac{y-t}{\sigma^{2}V(t)}dt, \end{equation*}

where \sigma^{2} is a scale parameter. There are also tests using likelihood ratio statistics for
model development to determine if any predictors may be dropped from the model.

15.5 - Nonlinear Regression


All of the models we have discussed thus far have been linear in the parameters (i.e., linear in
the beta's). For example, polynomial regression was used to model curvature in our data by
using higher-ordered values of the predictors. However, the final regression model was just a
linear combination of higher-ordered predictors.

Now we are interested in studying the nonlinear regression model:

\begin{equation*} Y=f(\textbf{X},\beta)+\epsilon, \end{equation*}

where X is a vector of p predictors, \beta is a vector of k parameters, f(\cdot) is some known


regression function, and \epsilon is an error term whose distribution may or may not be
normal. Notice that we no longer necessarily have the dimension of the parameter vector
simply one greater than the number of predictors. Some examples of nonlinear regression
models are:

\begin{align*} y_{i}&=\frac{e^{\beta_{0}+\beta_{1}x_{i}}}{1+e^{\beta_{0}+\beta_{1}x_{i}}}+
\epsilon_{i} \\ y_{i}&=\frac{\beta_{0}+\beta_{1}x_{i}}{1+\beta_{2}e^{\beta_{3}x_{i}}}+\epsilon_{i}
\\ y_{i}&=\beta_{0}+(0.4-\beta_{0})e^{-\beta_{1}(x_{i}-5)}+\epsilon_{i}. \end{align*}

However, there are some nonlinear models which are actually called intrinsically linear
because they can be made linear in the parameters by a simple transformation. For example:

\begin{equation*} Y=\frac{\beta_{0}X}{\beta_{1}+X} \end{equation*}

can be rewritten as

\begin{align*} \frac{1}{Y}&=\frac{1}{\beta_{0}}+\frac{\beta_{1}}{\beta_{0}}\frac{1}{X}\\
&=\theta_{0}+\theta_{1}\frac{1}{X}, \end{align*}

which is linear in the transformed parameters \theta_{0} and \theta_{1}. In such cases,
transforming a model to its linear form often provides better inference procedures and
confidence intervals, but one must be cognizant of the effects that the transformation has on
the distribution of the errors.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

25 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Nonlinear Least Squares

Returning to cases in which it is not possible to transform the model to a linear form, consider
the setting

\begin{equation*} Y_{i}=f(\textbf{X}_{i},\beta)+\epsilon_{i}, \end{equation*}

where the \epsilon_{i} are iid normal with mean 0 and constant variance \sigma^{2}. For this
setting, we can rely on some of the least squares theory we have developed over the course.
For other nonnormal error terms, different techniques need to be employed.

First, let

\begin{equation*} Q=\sum_{i=1}^{n}(y_{i}-f(\textbf{X}_{i},\beta))^{2}. \end{equation*}

In order to find

\begin{equation*} \hat{\beta}=\arg\min_{\beta}Q, \end{equation*}

we first find each of the partial derivatives of Q with respect to \beta_{j}. Then, we set each of
the partial derivatives equal to 0 and the parameters \beta_{k} are each replaced by
\hat{\beta}_{k}. The functions to be solved are nonlinear in the parameter estimates
\hat{\beta}_{k} and are often difficult to solve, even in the simplest cases. Hence, iterative
numerical methods are often employed. Even more difficulty arises in that multiple solutions
may be possible!

Algorithms for nonlinear least squares estimation include:

Newton's method, a classical method based on a gradient approach but which can be
computationally challenging and heavily dependent on good starting values.
The Gauss-Newton algorithm, a modification of Newton's method that gives a good
approximation of the solution that Newton's method should have arrived at but which is
not guaranteed to converge.
The Levenberg-Marquardt method, which can take care of computational difficulties
arising with the other methods but can require a tedious search for the optimal value of
a tuning parameter.

15.7 - Exponential Regression Example


One simple nonlinear model is the exponential regression model

\begin{equation*} y_{i}=\beta_{0}+\beta_{1}\exp(\beta_{2}x_{i,1}+\ldots+\beta_{p+1}x_{i,1})+
\epsilon_{i}, \end{equation*}

where the \epsilon_{i} are iid normal with mean 0 and constant variance \sigma^{2}. Notice
that if \beta_{0}=0, then the above is intrinsically linear by taking the natural logarithm of both
sides.

Exponential regression is probably one of the simplest nonlinear regression models. An


example where an exponential regression is often utilized is when relating the concentration
of a substance (the response) to elapsed time (the predictor).

To illustrate, consider the example on long-term recovery after discharge from hospital [5] from
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

26 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

page 514 of Applied Linear Regression Models (4th ed) by Kutner, Nachtsheim, and Neter.
The response variable, Y, is the prognostic index for long-term recovery and the predictor
variable, X, is the number of days of hospitalization. The proposed model is the two-
parameter exponential model:

\begin{equation*} Y_{i}=\theta_{0}\exp(\theta_{1}X_i)+\epsilon_{i}, \end{equation*}

where the \epsilon_i are independent normal with constant variance.

We'll use Minitab's nonlinear regression routine to apply the Gauss-Newton algorithm to
estimate \theta_0 and \theta_1. Before we do this, however, we have to find initial values for
\theta_0 and \theta_1. One way to do this is to note that we can linearize the response
function by taking the natural logarithm:

\begin{equation*} \log(\theta_{0}\exp(\theta_{1}X_i)) = \log(\theta_{0}) + \theta_{1}X_i.


\end{equation*}

Thus we can fit a simple linear regression model with response, \log(Y), and predictor, X, and
the intercept (4.0372) gives us an estimate of \log(\theta_{0}) while the slope (-0.03797) gives
us an estimate of \theta_{1}. (We then calculate \exp(4.0372)=56.7 to estimate \theta_0.)

Now we can fit the nonlinear regression model:

Select Stat > Regression > Nonlinear Regression, select prog for the
response, and click "Use Catalog" under "Expectation Function."
Select the "Exponential" function with 1 predictor and 2 parameters in the Catalog dialog
box and click OK to go to the "Choose Predictors" dialog.
Select days to be the "Actual predictor" and click OK to go back to the Catalog dialog
box, where you should see "Theta1 * exp( Theta2 * days )" in the "Expectation
Function" box.
Click "Parameters" and type "56.7" next to "Theta1" and "-0.038" next to "Theta2" and
click OK to go back to the Nonlinear Regression dialog box.
Click "Options" to confirm that Mintab will use the Gauss-Newton algorithm (the other
choice is Levenberg-Marquardt) and click OK to go back to the Nonlinear Regression
dialog box.
Click "Graphs" to confirm that Mintab will produce a plot of the fitted curve with data and
click OK to go back to the Nonlinear Regression dialog box.
Click OK to obtain the following output:

Nonlinear Regression: prog = Theta1 * exp(Theta2 * days)

Method
Algorithm Gauss-Newton
Max iterations 200
Tolerance 0.00001

Starting Values for Parameters


Parameter Value
Theta1 56.7
Theta2 -0.038

Equation
prog = 58.6066 * exp(-0.0395865 * days)
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

27 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Parameter Estimates
Parameter Estimate SE Estimate
Theta1 58.6066 1.47216
Theta2 -0.0396 0.00171

prog = Theta1 * exp(Theta2 * days)

Summary
Iterations 5
Final SSE 49.4593
DFE 13
MSE 3.80456
S 1.95053

15.8 - Population Growth Example


Census Data

A simple model for population growth towards an


asymptote is the logistic model

\begin{equation*} y_{i}=\frac{\beta_{1}}
{1+\exp(\beta_{2}+\beta_{3}x_{i})}+\epsilon_{i},
\end{equation*}

where y_{i} is the population size at time x_{i},


\beta_{1} is the asymptote towards which the
population grows, \beta_{2} reflects the size of the
population at time x = 0 (relative to its asymptotic size),
and \beta_{3} controls the growth rate of the
population.

We fit this model to Census population data


Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

28 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

(us_census.txt [6]) for the United States (in millions)


ranging from 1790 through 1990 (see below).

year population
1790 3.929
1800 5.308
1810 7.240
1820 9.638
1830 12.866
1840 17.069
1850 23.192
1860 31.443
1870 39.818
1880 50.156
1890 62.948
1900 75.995
1910 91.972
1920 105.711
1930 122.775
1940 131.669
1950 150.697
1960 179.323
1970 203.302
1980 226.542
1990 248.710

The data are graphed (see below) and the line represents the fit of the logistic population
growth model.

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

29 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

To fit the logistic model to the U. S. Census data, we need starting values for the parameters.
It is often important in nonlinear least squares estimation to choose reasonable starting
values, which generally requires some insight into the structure of the model. We know that
\beta_{1} represents asymptotic population. The data in the plot above show that in 1990 the
U. S. population stood at about 250 million and did not appear to be close to an asymptote; so
as not to extrapolate too far beyond the data, let us set the starting value of \beta_{1} to 350.
It is convenient to scale time so that x_{1}=0 in 1790, and so that the unit of time is 10 years.

Then substituting \beta_{1}=350 and x=0 into the model, using the value y_{1}=3.929 from the
data, and assuming that the error is 0, we have

\begin{equation*} 3.929=\frac{350}{1+\exp(\beta_{2}+\beta_{3}(0))}. \end{equation*}

Solving for \beta_{2} gives us a plausible start value for this parameter:

\begin{align*} \exp(\beta_{2})&=\frac{350}{3.929}-1\\ \beta_{2}&=\log\biggl(\frac{350}{3.929}-


1\biggr)\approx 4.5. \end{align*}

Finally, returning to the data, at time x = 1 (i.e., at the second Census performed in 1800), the
population was y_{2}=5.308. Using this value, along with the previously determined start
values for \beta_{1} and \beta_{2}, and again setting the error to 0, we have

\begin{equation*} 5.308=\frac{350}{1+\exp(4.5+\beta_{3}(1))}. \end{equation*}

Solving for \beta_{3} we get

\begin{align*} \exp(4.5+\beta_{3})&=\frac{350}{5.308}-1\\ \beta_{3}&=\log\biggl(\frac{350}


{5.308}-1\biggr)-4.5\approx -0.3. \end{align*}

So now we have starting values for the nonlinear least squares algorithm that we use. Below
is the output from fitting the model in Minitab v17 using Gauss-Netwon:

Select Stat > Regression > Nonlinear Regression


Select "population" for the Response
Type the following into the "Edit directly" box under Expectation Function:
beta1/(1+exp(beta2+beta3*(year-1790)/10))
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

30 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Click Parameters and type in the values specified above (350, 4.5, and –0.3)

As you can see, the starting values resulted in convergence with values not too far from our
guess.

Here is a plot of the residuals versus the year.

As you can see, the logistic functional form that we chose did catch the gross characteristics
of this data, but some of the nuances appear to not be as well characterized. Since there are
indications of some cyclical behavior, a model incorporating correlated errors or, perhaps,
trigonometric functions could be investigated.
Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

31 of 32 11-02-2018, 02:54
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/369

Source URL: https://onlinecourses.science.psu.edu/stat501/node/369

Links:
[1] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/leukemia_remission.txt
[2] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/DiseaseOutbreak.txt
[3] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/toxicity.txt
[4] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/poisson_simulated.txt
[5] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/recovery.txt
[6] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/us_census.txt

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/Typewriter/Regular/BasicLatin.js

32 of 32 11-02-2018, 02:54

S-ar putea să vă placă și