Documente Academic
Documente Profesional
Documente Cultură
S e c t i o n V
16
Understanding
RelationshipsNumerical
Data Part 2
Daniel M. Nagy/Shutterstock.com
Preview
Chapter Learning Objectives
16.1 The Simple Linear Regression
Model
16.2 Inferences Concerning the
Slope of the Population
Regression Line
16.3 Checking Model Adequacy
Are You Ready to Move On?
Chapter 16 Review Exercises
Technology Notes
AP* Review Questions for
Chapter 16
Preview
In Chapter 4, you learned how to describe relationships between two numerical
variables. When the relationship was judged to be linear you found the equation
of the least squares regression line and assessed the quality of the fit using the
scatterplot, the residual plot, and the values of the coefficient of determination (r2)
and the standard deviation about the least squares line (se). In this chapter you will
learn how to make inferences about the slope of the population regression line.
734
85241_ch16_ptg01.indd 734
20/12/12 6:39 PM
linear relationships.
M3Know the conditions for appropriate use of methods for making inferences about b.
M4Compute the margin of error when the sample slope b is used to estimate a population slope b.
M5Use the five-step process for estimation problems (EMC3) and computer output to construct and
interpret a confidence interval estimate for the slope of a population regression line.
M6Use the five-step process (HMC3) to test hypotheses about the slope of the population
regression line.
M7Use graphs to identify potential outliers and influential points.
Preview Example
Premature Babies
Babies born prematurely (before the 37th week of pregnancy) often have low birth
weights. Is a low birth weight related to factors that affect brain function? The
authors of the paper Intrauterine Growth Restriction Affects the Preterm Infants
Hippocampus(Pediatric Research [2008]: 438-43) hoped to use data from a study of
premature babies to answer this question. They measured x 5 birth weight (in grams)
and y 5 hippocampus volume (in mL) for 26 premature babies. The hippocampus
is a part of the brain that is important in the development of both short- and longterm memory. The sample correlation coefficient for their data is r 5 0.4722 and the
735
85241_ch16_ptg01.indd 735
20/12/12 6:39 PM
736
2.4
Hippocampus volume
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6
Figure 16.1
1.5
500
1000
1500
Birth weight
2000
2500
In this chapter, you will learn methods that will help you determine if there is a
real and useful linear relationship between two variables or if the pattern in the data
could be simply due to chance differences that occur when a sample is selected from a
population.
section 16.1
In a scatterplot of y versus x, some of the data points will fall above the graph of f(x)
and some will fall below. Thinking geometrically, if e . 0, the corresponding point in the
scatterplot will lie above the graph of the function y 5 f(x). If e , 0, the corresponding
point will fall below the graph of f(x).
For example, consider the probabilistic model
y 5 50 2 10x 1 x2 1 e
___________________
f(x)
The graph of the function y 5 50 2 10x 1 x2 is shown as the orange curve in Figure 16.2.
The observed point (4, 30) is also shown in the figure. Because f(4) 5 50 2 10(4) 1 42 5
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 736
20/12/12 6:39 PM
737
50 2 40 1 16 5 26 for this point, you can write y 5 f(x) 1 e, where e 5 4. The point
(4, 30) falls 4 above the graph of the function, y 5 50 2 10x 1 x2.
y
26
Graph of
y = 50 10x + x 2
Figure 16.2
Figure 16.3 shows two observations in relation to the population regression line.
y
Observation when x = x1
(positive deviation)
Population regression
line (slope b)
e2
e1
Observation when x = x2
(negative deviation)
a = vertical
intercept
Figure 16.3
0
0
x = x1
x = x2
85241_ch16_ptg01.indd 737
20/12/12 6:39 PM
738
Before you actually observe a value of y for any particular value of x, you are
uncertain about the value of e. It could be negative, positive, or even 0. Also, e might
be quite large in magnitude (resulting in a point far from the population regression line)
or quite small (resulting in a point very close to the line). The simple linear regression
model makes some assumptions about the distribution of e at any particular x value in
the population.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular x value has mean value 0. That is, me 5 0.
2. The standard deviation of e (which describes the spread of its distribution)
is the same for any particular value of x. This standard deviation is denoted
by se.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, ..., en associated with different observations are
independent of one another.
The simple linear regression model assumptions about the variability in the values
of e in the population imply that there is also variability in the y values observed at any
particular value of x. Consider y when x has some fixed value x*, so that
y 5 a 1 bx* 1 e.
Because a and b are fixed (they are unknown population values), a 1 bx* is also a
fixed number. The sum of a fixed number and a normally distributed variable (e) is
also a normally distributed variable (the bell-shaped curve is simply shifted), so y
itself has a normal distribution. Furthermore, me 5 0 implies that the mean value of y
is a 1 bx*, the height of the population regression line for the value x 5 x*. Finally,
because there is no variability in the fixed number a 1 bx*, the standard deviation of
y is the same as the standard deviation of e. These properties are summarized in the
following box.
and
) (
mean y value
height of the population
___________
5 ____________________
for x*
regression line above
x* 5 a 1 bx*
standard deviation of y for a fixed value x* 5 se
The slope b of the population regression line is the mean or expected change
in y associated with a 1-unit increase in x. The y intercept a is the height of
the population line when x 5 0.
The value of se determines how much the (x, y) observations deviate vertically
from the population line; when se is small, most observations will be close to
the line, but when se is large, the observations will tend to deviate more from
the line.
The key features of the model are illustrated in Figures 16.4 and 16.5. Notice that
the three normal curves in Figure 16.4 have identical spreads. This is a consequence of
se being the same at any value of x, which implies that the variability in the y values at a
particular value of x is constantthe variability does not depend on the value of x.
85241_ch16_ptg01.indd 738
20/12/12 6:39 PM
739
y
y = a + bx,
the population
regression line
(line of mean values)
a + bx3
a + bx2
a + bx1
Figure 16.4
x2
x3
Population regression
line
Population regression
line
Figure 16.5
(b)
(a)
85241_ch16_ptg01.indd 739
20/12/12 6:39 PM
740
Suppose that the actual model equation has a 5 0, b 5 0.001, and se 5 0.09 (these
values are consistent with the findings in the article). The population regression line is
shown in Figure 16.6.
y
Mean y when
= 0.19
x = 190
Population
regression line
y = 0.001x
Figure 16.6
x = 190
If the distribution of the random errors at any fixed weight (x value) is normal, then
the variable y 5 weight loss is normally distributed with
my 5 0 1 0.001x
sy 5 0.09
For example, when x 5 190 (corresponding to a 190-pound wrestler), weight loss has
mean value
my 5 0 1 0.001(190) 5 0.19 pounds
Because the standard deviation of y is sy 5 0.09, the interval 0.19 6 2(0.09) 5 (0.01,
0.37) includes y values that are within 2 standard deviations of the mean value for y when
x 5 190. Roughly 95% of the weight loss observations made for 190-lb wrestlers will be in
this range. The slope b 5 0.001 can be interpreted as the mean change in weight associated
with each additional pound of body weight.
More insight into model properties can be gained by thinking of the population of all
(x, y) pairs as consisting of many smaller subpopulations. Each subpopulation contains
pairs for which x has a fixed value. Suppose, for example, that in a large population of
college students the variables
x 5 grade point average in major courses
and
y 5 starting salary after graduation
are related according to the simple linear regression model. Then you can think about the
subpopulation of all pairs with x 5 3.20 (corresponding to all students with a grade point
average of 3.20 in major courses), the subpopulation of all pairs having x 5 2.75, and so
on. The model assumes that for each of these subpopulations, y is normally distributed
with the same standard deviation, and that the mean y value (rather than y itself) is linearly
related to x.
In practice, the judgment of whether the simple linear regression model is
appropriatethat is the judgments about the credibility of the assumptions underlying the
linear modelmust be based on knowledge of how the data were collected, as well as an
inspection of various plots of the data and the residuals. The sample observations should be
independent of one another, which will be the case if the data are from a random sample.
In addition, the scatterplot should show a linear rather than a curved pattern, and the vertical spread of points should be very similar throughout the range of x values. Figure 16.7
shows plots with three different patterns; only the first pattern is consistent with the simple
linear regression model assumptions.
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 740
20/12/12 6:39 PM
741
Figure 16.7
x
(a)
xx
(b)
(c)
)
(x 2 x
)(y 2 y
_
b 5 estimate of b 5 ______________
2
(x2 x
)
_
_
a 5 estimate of a 5 y
2 bx
The values of a and b are usually obtained using statistical software or a graphing
calculator. If the slope and intercept are calculated by hand, you can use the following computational formula:
(x)(y)
xy2 ________
n
_____________
b 5
2
(x)
x2 2 _____
y
5 a 1 bx
Let x* denote a specified value of the independent variable x. Then a 1 bx* has
two different interpretations:
1. It is a point estimate of the mean y value when x 5 x*.
2. It is a point prediction of an individual y value to be observed when x 5 x*.
Example 16.2 Mothers Age and Babys Birth Weight
Medical researchers have noted that adolescent females are much more likely to deliver
low-birth-weight babies than are adult females. (Low birth weight in humans is generally
defined as a weight below 2,500 grams) Because low-birth-weight babies have higher
mortality rates, a number of studies have examined the relationship between birth weight
and mothers age for babies born to young mothers.
One such study is described in the article Body Size and Intelligence in 6-Year-Olds:
Are Offspring of Teenage Mothers at Risk? (Maternal and Child Health Journal [2009]:
847-856). The following data on
85241_ch16_ptg01.indd 741
20/12/12 6:39 PM
742
are consistent with summary values given in the article and also with data published by the
National Center for Health Statistics.
Observation
1
10
15
17
18
15
16
19
17
16
18
19
2,289
3,393
3,271
2,648
2,897
3,327
2,970
2,535
3,138
3,573
A scatterplot of the data is given in Figure 16.8. The scatterplot shows a linear pattern,
and the spread in the y values appears to be similar across the range of x values. This
supports the appropriateness of the simple linear regression model.
Babys weight (g)
3500
3000
2500
Figure 16.8
15
16
17
Mothers age (yr)
18
19
For these data, the equation of the estimated regression line was found using statistical
software, resulting in
y
5 a 1 bx 5 21,163.45 1 245.15x
An estimate of the mean birth weight of babies born to 18-year-old mothers results
from substituting x 5 18 into the estimated equation:
5 21,163.45 1 245.15(18)
5 3,249.25 grams
Similarly, you would predict the birth weight of a baby to be born to a particular
18-year-old mother to be
y
5 predicted y value when x 5 18
5 a 1 b(18)
5 3,249.25 grams
The estimate of the mean weight and the prediction of an individual baby weight are
identical, because the same x value was used in each calculation. However, their interpretations differ. One is the prediction of the weight of a single baby whose mother is 18, whereas
the other is an estimate of the mean weight of all babies born to 18-year-old mothers.
In Example 16.2, the x values in the sample ranged from 15 to 19. The estimated
regression equation should not be used to make an estimate or prediction for any x value
much outside this range. Without sample data for such values, or some clear theoretical
reason for expecting the relationship to be linear outside the observed range of x values,
you have no reason to believe that the estimated linear relationship continues outside the
range from 15 to 19. Making predictions outside this range can be misleading, and statisticians refer to this as the danger of extrapolation.
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 742
20/12/12 6:39 PM
743
The value of se determines the extent to which observed points (x, y) tend to fall close to
or far away from the population regression line. A point estimate of se is based on
SSResid 5 (y 2 y
) 2
where y
1 5 a 1 bx1, , y
n 5 a 1 bxn are the fitted or predicted y values and the residuals
are y1 2 y
1, yn 2 y
n. SSResid is a measure of the extent to which the sample data spread
out around the estimated regression line.
Definition
The statistic for estimating the variance s2eis
SSResid
s2e 5 _______
n22
where
SSResid 5 (y 2 y
) 2 5 y2 2 a y 2 b xy
The subscript in s2eand s2eis a reminder that you are estimating the variance of the
errors or residuals.
The estimate of se is the estimated standard deviation
__
s2e
se 5
85241_ch16_ptg01.indd 743
20/12/12 6:39 PM
744
resulting data (from a scatterplot in the paper) is given in the accompanying table. The table
also includes the predicted values and residuals for the estimated regression line.
Girth (cm)
Weight(kg)
Predicted
y Value
Residual
96
105
108
109
110
114
121
124
131
135
137
138
140
142
157
157
159
155
162
87
196
163
196
183
171
230
225
211
231
225
266
241
264
284
292
300
337
339
136.266
161.069
169.336
172.092
174.848
185.871
205.162
213.429
232.720
243.744
249.255
252.011
257.523
263.034
304.372
304.372
309.884
298.860
318.151
238.2661
34.9314
26.3361
23.9080
8.1522
214.8711
24.8380
11.5705
221.7203
212.7436
224.2553
13.9889
216.5228
0.9655
220.3720
212.3720
29.8837
38.1397
20.8488
The scatterplot (Figure 16.9) gives evidence of a strong positive linear relationship between
x 5 chest girth (in cm)
and
y 5 weight in (kg)
350
Weight (kg)
300
250
200
150
100
Figure 16.9
90
100
110
120
130
Girth (cm)
140
150
160
170
Coef
2135.51
2.8063
SE Coef
T
35.75 23.79
0.2686 10.45
R-Sq 5 86.5%
P
0.001
0.000
R-Sq(adj) 5 85.7%
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 744
20/12/12 6:39 PM
745
y
5 2136 1 2.81x
r2 5 0.865
Se 5 23.6626
Approximately 86.5% of the observed variation in elk weight y can be attributed to the
linear relationship between weight and chest girth. The magnitude of a typical deviation
from the least-squares line is about 23.6626 kg, which is relatively small in comparison to
the y values themselves.
Another important assumption of the simple linear regression model is that the
random deviations at any particular x value are normally distributed. In Section 16.3,
you will see how the residuals can be used to determine whether this assumption is
plausible.
section
16.1Exercises
Each exercise set assesses the following chapter learning objectives: C1, M1
Section 16.1
Exercise Set 1
16.1 Identify the following relationships as deterministic
or probabilistic:
a. The relationship between the length of the sides of a
square and its perimeter.
b. The relationship between the height and weight of an adult.
c. The relationship between SAT score and college freshman
GPA.
d. The relationship between tree height in centimeters and
tree height in inches.
16.2 Let x be the size of a house (in square feet) and y be
the amount of natural gas used (therms) during a specified
period. Suppose that for a particular community, x and y are
related according to the simple linear regression model with
85241_ch16_ptg01.indd 745
house price (in dollars) and x 5 house size (in square feet)
for houses in a large city. The population regression line is
y 5 23,000 1 47x and se 5 5000.
a. What is the average change in price associated with one
extra square foot of space? With an additional 100 sq. ft.
of space?
b. Approximately what proportion of 1800 sq. ft. homes
would be priced over $110,000? Under $100,000?
Section 16.1
Exercise Set 2
16.4 Identify the following relationships as deterministic
or probabilistic:
a. The relationship between height at birth and height at one
year of age.
b. The relationship between a positive number and its
square root.
c. The relationship between temperature in degrees
Fahrenheit and degrees centigrade.
d. The relationship between adult shoe size and shirt size.
16.5 The flow rate in a device used for air quality measurement depends on the pressure drop x (inches of water) across
the devices filter. Suppose that for x values between 5 and
20, these two variables are related according to the simple
linear regression model with population regression line
y 5 20.12 1 0.095x.
a. What is the mean flow rate for a pressure drop of
10 inches? A drop of 15 inches?
b. What is the average change in flow rate associated with
a 1 inch increase in pressure drop? Explain.
16.6 The paper Predicting Yolk Height, Yolk Width,
Albumen Length, Eggshell Weight, Egg Shape Index, Eggshell
Thickness, Egg Surface Area of Japanese Quails Using
Various Egg Traits as Regressors (International Journal of
20/12/12 6:39 PM
746
85241_ch16_ptg01.indd 746
HRT Use
46.30
40.60
39.50
36.60
30.00
103.30
105.00
100.00
93.80
83.50
studied a number of
variables they thought might be related to bone mineral
density (BMD). The accompanying data on x 5 weight
at age 13 and y 5 bone mineral density at age 27 are
consistent with summary quantities for women given in the
paper.
Research [1994], 10891096)
20/12/12 6:39 PM
BMD (g/cm2)
54.4
59.3
74.6
62.0
73.7
70.8
66.8
66.7
64.7
71.8
69.7
64.7
62.1
68.5
58.3
1.15
1.26
1.42
1.06
1.44
1.02
1.26
1.35
1.02
0.91
1.28
1.17
1.12
1.24
1.00
747
5
Emb Ves Diam (cm)
1.5
1.4
BMD (g/cm^2)
1.3
4
3
2
1.2
1.1
25
35
30
Gest Age (days)
40
Linear Fit
0.9
Linear Fit
0.8
55
60
65
Weight (kg)
70
75
Linear Fit
Linear Fit
BMD (g/cm^2) = 0.5584011 + 0.0094363*Weight (kg)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.121081
0.053472
0.155141
1.18
15
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Intercept
Weight (kg)
0.792803
0.780615
0.450587
2.482526
19
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Intercept
Gest Age (days)
Estimate
3.497279
0.1903121
Std Error
0.748605
0.023597
t Ratio
4.67
8.07
Prob>|t|
0.0002*
<.0001*
85241_ch16_ptg01.indd 747
20/12/12 6:39 PM
748
section 16.2
________
_ 2
(x 2 x
)
3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed).
The fact that b is unbiased tells you that the sampling distribution is centered at the right
place, but it gives no information about variability. If sb is large, the sampling distribution of
b will be quite spread out around b and an estimate far from the value of b could result. For
se
________
sb 5 ___________
_
to be small, the numerator se should be small (little variability about the
(x 2 x)2
________
_
_
population line) and/or the denominator
(x 2 x)2
should be large. Because (x 2 x)2 is a
measure of how much the observed x values spread out, b tends to be more precisely estimated
when the x values in the sample are spread out rather than when they are close together. The
normality of the sampling distribution of b implies that the standardized variable
b2b
z 5 ______
sb
has a standard normal distribution. However, inferential methods cannot be based on this
statistic, because the value of sb is not known (because the unknown se appears in the
numerator of sb). One way to proceed is to estimate se with se to obtain an estimate of sb.
The estimated standard deviation of the statics b is
se
________
sb 5 ___________
_ 2
(x 2 x
)
85241_ch16_ptg01.indd 748
When the four basic assumptions of the simple linear regression model are satisfied,
b2b
is the
the probability distribution of the standardized variable t 5 ______
s
t distribution with df 5 (n 2 2).
20/12/12 6:39 PM
749
x2 m
was used in Chapter 12 to develop a confidence interIn the same way that t 5 ______
s
____
__
n
val for m, the t variable in the preceding box can be used to obtain a confidence interval for b.
Confidence Interval for b
When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has
the form
b 6 (t critical value)sb
where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical
values corresponding to the most frequently used confidence levels.
The interval estimate of b is centered at b and extends out from the center by an amount
that depends on the sampling variability of b. When sb is small, the interval is narrow, implying that the investigator has relatively precise knowledge of the value of b. Calculation of a
confidence interval for the slope of a population regression line is illustrated in Example 16.4.
In Section 7.2, you learned four key questions that guide the decision about what statistical inference method to consider in any particular situation. In Section 7.3, a five-step
process for estimation problems was introduced.
The four key questions of section 7.2 were
Q
Question Type
S
Study Type
T
Type of Data
N
Number of Samples or
Treatments
85241_ch16_ptg01.indd 749
20/12/12 6:39 PM
750
researchers studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence
reproduction is stress due to accumulated snow, which makes foraging more difficult for
the pregnant bison. Data from 19811997 on
y 5 spring calf ratio (SCR)
and
x 5 previous fall snow-water equivalent (SWE)
are shown in the accompanying table. Spring calf ratio is the ratio of calves to adults, a
measure of reproductive success. The researchers were interested in estimating the mean
change in spring calf ratio associated with each additional cm in snow-water equivalent.
Lets answer the four key questions for this problem.
SCR
SWE
SCR
SWE
0.19
0.14
0.21
0.23
0.26
0.19
0.29
0.23
0.16
1,933
4,906
3,072
2,543
3,509
3,908
2,214
2,816
4,128
0.22
0.22
0.18
0.21
0.25
0.19
0.22
0.17
3,317
3,332
3,511
3,907
2,533
4,611
6,237
7,279
The answers are estimation, sample data, two numerical variables, one sample. This
Q
Question Type
S
Study Type
T
Type of Data
N
Number of Samples
or Treatments
Estimation
Sample data
combination of answers suggests considering a confidence interval for the slope of a population regression line. You can now use the five-step process (EMC3) to estimate the slope
of the population regression line.
Step
Estimate
In this example, the value of b, the mean increase in spring calf ratio for each
additional 1 cm of snow-water equivalent, will be estimated.
Method
Because the answers to the four key questions are estimation, sample data,
two numerical values, one sample, a confidence interval for b, the slope of
the population regression line, will be considered.
For this example, a 95% confidence level will be used.
Check
The four basic assumptions of the simple linear regression model need to be
met in order to use the confidence interval.
(continued)
85241_ch16_ptg01.indd 750
20/12/12 6:39 PM
751
Step
SCR
0.250
0.225
0.200
0.175
0.150
2000
3000
4000
5000
SWE
6000
7000
8000
0.050
0.025
0.000
0.025
Residuals
0.050
0.075
(continued)
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 751
20/12/12 6:39 PM
752
Step
Parameter Estimates
Term Estimate Std Error t Ratio
Intercept 0.2606561 0.023885 10.91
SWE 20.013664 0.005989 22.28
Prob>|t|
<.0001*
0.0375*
sb
df 5 n 2 2 = 17 2 2 = 15
The t critical value for a 95% confidence level and df 5 15 is 2.13.
b 6(t critical value)sb
5 20.0137 6(2.13)(0.00599)
5 (20.265, 20.0009)
Communicate
Results
Confidence interval:
You can be 95% confident that the true average change in spring calf
ratio associated with an increase of 1 cm in the snow-water equivalent is
between 20.0265 and 20.0009.
Confidence level:
The method used to construct this interval estimate is successful in
capturing the actual value of the slope of the population regression about
95% of the time.
sb
where b0 is the hypothesized value from the null hypothesis.
Form of the null hypothesis: H0: b 5 b0
When the assumptions of the simple linear regression model are reasonable and
the null hypothesis is true, the t test statistic has a t distribution with df 5 n 2 2.
Associated P-value:
When the alternative
hypothesis is
The P-value is
Ha: b . b0
85241_ch16_ptg01.indd 752
20/12/12 6:39 PM
Ha: b , b0
Ha: b b0
753
This test is a method you should consider when the answers to the four key questions
are hypothesis testing, sample data, two numerical variables, one sample. You would carry
out this test using the five-step process for hypothesis testing problems (HMC3).
Inference for a population slope generally focuses on two questions:
The question of plausible values can be addressed by calculating a confidence interval for
the population slope. The question of whether a population slope is equal to zero can be
answered by using the hypothesis testing procedure with a null hypothesis H0: b 5 0. This
test of H0: b 5 0 versus Ha: b 0 is called the model utility test for simple linear regression.
The default computer output for inference for a regression slope is for the model utility test.
When the null hypothesis of the model utility test is true, the population regression
line is a horizontal line, and the value of y in the simple linear regression model does not
depend on x. That is,
y 5 a 1 bx 1 e
5 a 1 0x 1 e
5a1e
If b is in fact equal to 0, knowledge of x will be of no use it will have no utility
for predicting y. On the other hand, if b is different from 0, there is a useful linear relationship between x and y, and knowledge of x is useful for predicting y. This is illustrated by
the scatterplots in Figure 16.10.
nonzero slope
slope = 0
Figure 16.10
(a) b 5 0; (b) b 0
(a)
(b)
85241_ch16_ptg01.indd 753
20/12/12 6:39 PM
754
relationship between x and y. If H0 is rejected, you can conclude that the simple
linear regression model is useful for predicting y.
The test statistic is the t ratio
(b 2 0) __
b
t 5 _______
5 s .
s
b
It is recommended that the model utility test be carried out before using the estimated
regression line to make inferences.
Judged
Release
Actual
Release
Judged
Release
Actual
Release
Judged
Release
Actual
Release
Judged
Release
1998
1967
1998
1999
1983
1982
1965
1991
1983
1976
1971
1981
1967
2007
1997.2
1973.7
1996.3
1993.3
1985.4
1988.0
1970.2
1992.8
1984.1
1979.3
1975.4
1984.6
1973.7
1997.2
1976
2008
1971
1965
1967
1971
1967
1984
1984
1968
1965
1965
1979
1997
1983.3
1995.0
1979.8
1976.8
1975.0
1978.0
1978.0
1983.3
1989.8
1976.7
1978.5
1977.2
1986.7
1996.3
1976
2006
1974
2007
1976
1974
1970
1971
1999
1997
2006
1981
2008
1965
1988.0
1996.7
1985.4
1999.8
1987.2
1977.6
1982.8
1976.3
1988.5
1994.1
1995.4
1989.3
1993.7
1981.1
1970
1975
1991
2008
1965
1987
1975
1968
1987
2008
1982
1979
2000
2000
1985.4
1985.9
1993.3
1995.4
1977.6
1990.7
1986.3
1986.7
1988.0
1990.2
1991.1
1983.7
1989.8
1991.1
actual release year and the average of the release years given by the students. The actual
release years ranged from 1965 (The Beatles, Help) to 2008 (Katy Perry, I Kissed a Girl).
Is there a relationship between the judged and actual release year for these songs? A
scatterplot of the data (Figure 16.11) suggests that there is a linear relation between these
two variables, but this can be confirmed this using the model utility test.
With x 5 actual release year and y 5 judged release year, the equation of the esti
mated regression line is y
5 1095 1 0.449x. The five-step process for hypothesis testing
can be used to carry out the model utility test.
85241_ch16_ptg01.indd 754
20/12/12 6:39 PM
755
Judged
1990
1985
1980
1975
1970
Figure 16.11
1960
1970
1980
1990
2000
2010
Actual
Process Step
H Hypotheses
In the model utility test, the null hypothesis is there is no useful relationship between the actual and the judged
release year: H0: b 5 0.
The alternative hypothesis specifies that there is a useful relationship: b 0.
Hypotheses:
Null hypothesis: H0: b 5 0
Alternative hypothesis: Ha: b 0
M Method
Because the answers to the four key questions are hypothesis testing, sample data, two numerical variables in a
regression setting and one sample, a hypothesis test for the slope of a population regression line will be considered.
The test statistic for this test is
b20
b
t 5 _____
5 __
s
sb
b
The value of 0 in the test statistic is the hypothesized value from the null hypothesis.
For this example, a significance level of 0.05 will be used.
Significance level:
a 5 0.05
C Check
In Section 16.3, you will see how to check to see if the four assumptions of the simple linear regression model
are reasonable. For this example, you can assume that these assumptions are reasonable and proceed with the
model utility test.
C Calculate
sb
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
1095.1525
16.58 <.0001*
66.07159
Actual Release
0.449281 0.033321
13.48 <.0001*
(continued)
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 755
20/12/12 6:39 PM
756
Test statistic:
.449 2 0
b 2 0 0__________
t 5 _____
5 13.48
sb
0.0333
Associated P-value:
Because the P-value is less than the selected significance level, the null hypothesis is rejected.
Decision: Reject H0.
Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the
actual release year and the judged release year.
Because the model utility test confirms that there is a useful linear relationship
between judged release year and actual release year, it would be reasonable to use the
estimated regression model to predict the judged release year for a given song based on
its actual release year. Of course, before you do this, you would also want to evaluate the
accuracy of predictions by looking at the value of se.
When H0: b 5 0 cannot be rejected using the model utility test at a reasonably small significance level, the search for a useful model must continue. One possibility is to relate y
to x using a nonlinear model an appropriate strategy if the scatterplot shows curvature.
section
16.2Exercises
Each exercise set assesses the following chapter learning objectives: C2, M3, M4 , M5, M6, P1, P2
Section 16.2 Exercise Set 1
16.13 The standard deviation of the errors, se, is an important part of the linear regression model.
a. What is the relationship between the value of se and the
value of the test statistic in a test of a hypotheses about b?
b. What is the relationship between the value of se and the
width of a confidence interval for b?
85241_ch16_ptg01.indd 756
2024
1.90
5038
3.96
905
2.44
3572
0.88
1157
0.37
327
20.90
378
0.49
191
1.01
20/12/12 6:39 PM
Coef SE Coef
T
P
60.0286 0.2466 243.45 0.000
0.00357 0.03823
0.09 0.931
DF
SS
MS
F
P
1 0.0023 0.0023 0.01 0.931
3 0.7857 0.2619
4 0.7880
linear regression model to relate y 5 green biomass concentration (g/cm3) to x 5 elapsed time since snowmelt (days).
85241_ch16_ptg01.indd 757
757
Section 16.2
Exercise Set 2
16.20 Consider a test of hypotheses about, b the population
slope in a linear regression model.
a. If you reject the null hypothesis, b 5 0, what does this
mean in terms of a linear relationship between x and y?
b. If you fail to reject the null hypothesis, b 5 0, what does
this mean in terms of a linear relationship between x and y?
16.21 Researchers studying pleasant touch sensations measured the firing frequency (impulses per second) of nerves that
were stimulated by a light brushing stroke on the forearm and
also recorded the subjects numerical rating of how pleasant
the sensation was. The accompanying data was read from a
graph in the paper Coding of Pleasant Touch by Unmyelinated
Afferents in Humans (Nature Neuroscience, April 12, 2009).
Firing
Frequency
23
24
22
25
27
Pleasantness
Rating
0.2
1.0
1.2
1.2
1.0
Firing
Frequency
28
34
33
36
34
Pleasantness
Rating
2.0
2.3
2.2
2.4
2.8
a. Estimate the mean change in pleasantness rating associated with an increase of 1 impulse per second in firing
frequency using a 95% confidence interval. Interpret the
resulting interval.
b. Carry out a hypothesis test to decide if there is convincing
evidence of a useful linear relationship between firing
frequency and pleasantness rating.
16.22 The largest commercial fishing enterprise in the
southeastern United States is the harvest of shrimp. In a
study described in the paper Long-term Trawl Monitoring
of White Shrimp, Litopenaeus setiferus (Linnaeus), Stocks
within the ACE Basin National Estuariene Research Reserve,
South Carolina (Journal of Coastal Research [2008]:193-199),
20/12/12 6:39 PM
758
Study
Control
1
2
3
4
5
6
7
8
9
10
250
360
475
525
610
740
880
920
1010
1200
CHI
3 03
491
659
683
922
1044
1421
1329
1481
1815
x
y
8.7
0.28
12.7
0.55
19.1
0.68
21.4
0.85
24.6
1.02
28.9
1.15
29.8
1.34
30.5
1.29
PeakPhotoVoltage
1.2
Additional Exercises
16.24 a. Explain the difference between the line y 5 a 1 bx
1
0.8
0.6
0.4
0.2
0
0
10
15
20
%LightAbsorption
25
30
35
Linear Fit
Linear Fit
PeakPhotoVoltage = 0.082594 + 0.0446485* %LightAbsorption
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.982731
0.980264
0.061117
0.808889
9
Analysis of Variance
Parameter Estimates
Term
Estimate
Intercept
0.082594
%LightAbsorption 0.0446485
85241_ch16_ptg01.indd 758
20/12/12 6:39 PM
16.28 In anthropological studies, an important characteristic of fossils is cranial capacity. Frequently skulls
are at least partially decomposed, so it is necessary to
use other characteristics to obtain information about
capacity. One measure that has been used is the length
of the lambda-opisthion chord. The article Vertesszollos
and the Presapiens Theory (American Journal of Physical
Anthropology [1971]) reported the accompanying data for n
78
75
78
81
84
86
87
850
775
750
975
915
1015
1030
section 16.3
759
Linear Fit
y = 5.6452776 + 0.9797401*x
Summary of Fit
RSquare
0.985289
RSquare Adj
0.984954
Root Mean Square Error
12.48525
Mean of Response
0.791304
Observations (or Sum Wgts)
46
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Estimate
Intercept 5.6452776
0.9797401
x
Residual Analysis
If the deviations e1, e2, , en from the population line were available, they could be examined for any inconsistencies with model assumptions. For example, a normal probability
plot of these deviations would suggest whether or not the normality assumption was plausible. However, because these deviations are
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 759
20/12/12 6:39 PM
760
e1 5 y1 2 (a 1 bx1)
:
en 5 yn 2 (a 1 bxn)
they can be calculated only if a and b are known. In practice, this will almost never be the
case. Instead, diagnostic checks must be based on the residuals
y1 2 y
1 5 y1 2 (a 1 bx1)
:
yn 2 y
n 5 yn 2 (a 1 bxn)
which are the deviations from the estimated regression line. When all model assumptions are
met, the mean value of the residuals at any particular x value is 0. Any observation that gives
a large positive or negative residual should be examined carefully for any unusual circumstances, such as a recording error or nonstandard experimental condition. Identifying residuals with unusually large magnitudes is made easier by inspecting standardized residuals.
Recall that a quantity is standardized by subtracting its mean value (0 in this case) and
dividing by its actual or estimated standard deviation:
residual
standardized residual 5 _________________________________
Weight (kg)
300
250
200
150
100
90
Figure 16.12
100
110
120
130
Girth (cm)
140
150
160
170
________________
_ 2
(xi 2 x
)
1
__
________
_ 2
1 2 n 2
(x 2 x
)
85241_ch16_ptg01.indd 760
20/12/12 6:39 PM
761
The data, residuals, and the standardized residuals (computed using Minitab) are
given in Table 16.1. For the residual with the largest magnitude, 38.1397, the standardized
residual is 1.81294. That is, the residual is approximately 1.8 standard deviations above
its expected value of 0. This value is not particularly unusual in a sample of this size. Also
notice that for the negative residual with the largest magnitude, 238.2661, the standardized residual is 21.92313, still not unusual in a sample of this size. On the standardized
scale, no residual here is surprisingly large.
Table 16.1 Data, residuals, and standardized residuals for the elk data
Observation
Girth (cm)
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Weight (kg)
y
96
105
108
109
110
114
121
124
131
135
137
138
140
142
157
157
159
155
162
98
196
163
196
183
171
230
225
211
231
225
266
241
264
284
292
300
337
339
Residual
238.2661
34.9314
26.3361
23.9080
8.1522
214.8711
24.8380
11.5705
221.7203
212.7436
224.2553
13.9889
216.5228
0.9655
220.3720
212.3720
29.8837
38.1397
20.8488
Standardized
Residual
21.92313
1.68004
20.30135
1.13323
0.38517
20.69477
1.14452
0.53117
20.99323
20.58320
21.11135
0.64147
20.75921
0.04448
20.97540
20.59236
20.47699
1.81294
1.01967
136.266
161.069
169.336
172.092
174.848
185.871
205.162
213.429
232.720
243.744
249.255
252.011
257.523
263.034
304.372
304.372
309.884
298.860
318.151
Next, consider the assumption of the normality of es. Figure 16.13 shows box plots of
the residuals and standardized residuals. The box plots are approximately symmetric and
there are no outliers, so the assumption of normally distributed errors seems reasonable.
40
30
20
10
0
10
Residual
20
30
40
0
Standardized Residual
Figure 16.13
Notice that the boxplots of the residuals and standardized residuals are nearly identical. While it is preferable to work with the standardized residuals, if you do not have access
to a computer package or calculator that will produce standardized residuals, a plot of the
unstandardized residuals should suffice.
A normal probability plot of the standardized residuals (or the residuals) is
another way to assess whether it is reasonable to assume that e1, e2,..., en all come
from the same normal distribution. An advantage of the normal probability plot, shown
in Figure 16.14, is that the value of each residual can be seen, which provides more
information about the distribution. The pattern in the normal probability plot of the
85241_ch16_ptg01.indd 761
20/12/12 6:39 PM
1
Normal score
Normal score
762
2
40
30
Figure 16.14
20
10
0
10
Residual
20
30
40
0
Standardized residual
standardized residuals and pattern in the normal probability plot of the the residuals
for the elk data are reasonably straight, confirming that the assumption of normality of
the error distribution is reasonable. Also notice that the pattern in both normal probability plots is similar, so you dont need to construct botheither plot could be used.
A plot of the (x, residual) pairs is called a residual plot, and a plot of the (x, standardized
residual) pairs is a standardized residual plot. Residual and standardized residual plots
typically exhibit the same general shapes. If you are using a computer package or graphing
calculator that calculates standardized residuals, the standardized residual plot is recommended. If not, it is acceptable to use the unstandardized residual plot instead.
A standardized residual plot or a residual plot is often helpful in identifying unusual
or highly influential observations and in checking for violations of model assumptions. A
desirable plot is one that exhibits no particular pattern (such as curvature or a much greater
spread in one part of the plot than in another) and that has no point that is far removed from
all the others. A point in the residual plot falling far above or far below the horizontal line
at height 0 corresponds to a large residual, which can indicate unusual behavior, such as a
recording error, a nonstandard experimental condition, or an atypical experimental subject.
A point with an x value that differs greatly from others in the data set could have exerted
excessive influence in determining the estimated regression line.
A standardized residual plot, such as the one pictured in Figure 16.15(a) is desirable,
because no point lies much outside the horizontal band between 22 and 2 (so there is no
unusually large residual corresponding to an outlying observation). There is no point far to
the left or right of the others (which could indicate an observation that might greatly influence the estimated line), and there is no pattern to indicate that the model should somehow
be modified. When the plot has the appearance of Figure 16.15(b), the fitted model should
be changed to incorporate curvature (a nonlinear model).
The increasing spread from left to right in Figure 16.15(c) suggests that the
variance of y is not the same at each x value but rather increases with x. A straightline model may still be appropriate, but the best-fit line should be obtained by using
weighted least squares rather than ordinary least squares. This involves giving more
weight to observations in the region exhibiting low variability and less weight to
observations in the region exhibiting high variability. A specialized regression analysis
textbook or a statistician should be consulted for more information on using weighted
least squares.
The standardized residual plots of Figures 16.15(d) and 16.15(e) show an outlier (a
point with a large standardized residual) and a potentially influential observation, respectively. Consider deleting the observation corresponding to such a point from the data set
and refitting a line. Substantial changes in estimates and various other quantities are a
signal that a more careful analysis should be carried out before proceeding.
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 762
20/12/12 6:39 PM
Standardized
residual
763
Standardized
residual
(a)
(b)
Standardized
residual
Standardized
residual
1
x
Large
residual
2
(d)
(c)
Standardized
residual
2
FIGURE 16.15
1
0
Potentially
influential
observation
1
2
(e)
13.00
12.75
16.70
18.85
16.60
15.35
13.90
213.5
215.7
215.5
214.7
216.1
214.6
213.4
Standardized Residual
20.11
22.19
20.36
1.23
20.91
20.12
0.34
22.40
16.20
16.70
13.65
13.90
14.75
218.9
214.8
213.6
214.0
212.0
213.5
Standardized Residual
21.54
0.04
1.25
20.28
21.54
0.58
A simple linear regression analysis described in the article included r2 5 0.52 and
r 5 0.72, suggesting a significant linear relationship. This is confirmed by a model
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 763
20/12/12 6:39 PM
764
utility test. The scatterplot and standardized residual plot are displayed in Figure 16.16.
There are no unusual patterns, although one standardized residual, 22.19, is a bit on the
large side. The most interesting feature is the observation (22.40, 218.9), corresponding
to a point far to the right of the others in these plots. This observation may have had a
substantial influence on the estimated regression line. The estimated slope when all 13
observations are included is b 5 20.459, and sb 5 0.133. When the potentially influential observation is deleted, the estimate of b based on the remaining 12 observations is
b 5 20.228. The change in slope is
change in slope 5 original b 2 new b
5 20.459 2 (2 0.288)
5 20.231
The change expressed in standard deviations is 20.231/0.133 5 21.74. Because b
has changed by substantially more than 1 standard deviation, the observation under consideration appears to be highly influential.
TEMP
-11.5 +
-13.0 +
-14.5 +
-16.0 +
-17.5 +
-19.0 +
* *
*
*
*
*
*
*
+-----------+-----------+-----------+-----------+-----------+
SNOW
12.5
15.0
17.5
20.0
22.5
25.0
Figure 16.16
Plots for the data of Example 16.7:
(a) Scatter plot; (b) Standardized
residual plot
(a)
STRESID
2.0 +
Potentially influential
*
observation
*
*
1.0 +
*
*
* *
0.0 +
*
*
*
*
-1.0 +
*
-2.0 +
*
-3.0 +
+-----------+-----------+-----------+-----------+-----------+ SNOW
12.5
15.0
17.5
20.0
22.5
25.0
(b)
In addition, r2 based just on the 12 observations is only 0.13, and the t ratio for testing
b 5 0 is not significant. Evidence for a linear relationship is much less conclusive in light
of this analysis. The investigators should seek a climatological explanation for the influential observation and collect more data, which could be used to find a more useful model.
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 764
20/12/12 6:39 PM
765
Figure 16.17
1
2
3
4
5
6
7
8
9
10
11
Treadmill
Ski Time
Residual
Standardized
Residual
71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
0.172
2.206
3.494
0.906
1.994
3.006
2.461
0.394
2.373
0.527
0.206
0.10
1.13
1.74
0.44
0.96
1.44
1.18
0.19
1.16
0.27
0.12
7.7
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
Figure 16.17 shows a normal probability plot of the standardized residuals and a standardized residual plot. The normal probability plot is quite straight, and the standardized
residual plot does not show evidence of any patterns or of increasing spread.
Standardized residual
Standardized residual
21
21
22
22
22
21
0
Normal score
(a)
10
Treadmill time
11
12
(b)
85241_ch16_ptg01.indd 765
20/12/12 6:39 PM
766
20
2
Standardized residual
Insertion depth
18
16
14
12
1
0
1
2
10
3
50
75
100
Figure 16.18
Figure 16.19
125
Height
150
175
75
100
(a)
125
Height
150
175
200
(b)
Residual plots like the ones pictured in Figure 16.18(b) are desirable. No point lies
much outside the horizontal band between 22 and 2 (so there are no unusually large
residuals corresponding to outliers). There is no point far to the left or right of the others
(no observation that might be influential), and there is no pattern of curvature or differences in the variability of the residuals for different height values to indicate that the model
assumptions are not reasonable.
But consider what happens when the relationship between insertion depth and weight is
examined. A scatterplot of insertion depth and weight (kg) is shown in Figure 16.19(a), and a
standardized residual plot in Figure 16.19(b). While some curvature is evident in the original
scatterplot, it is even more clearly visible in the standardized residual plot. A careful inspection
of these plots suggests that along with curvature, the residuals may be more variable at larger
weights. When plots have this curved appearance and increasing variability in the residuals, the
linear regression model is not appropriate.
3
22
2
Standardized residual
24
20
Insertion depth
50
200
18
16
14
12
1
0
1
2
10
3
0
10
20
30
40
50
60
70
80
90
Weight
10
20
30
40
50
Weight
(a)
(b)
60
70
80
90
85241_ch16_ptg01.indd 766
20/12/12 6:39 PM
767
Figure 16.20
behavior, is a cluster of males gathered in a relatively small area to exhibit courtship displays.
The female preference hypothesis asserts that females will prefer larger leks over smaller
leks, presumably because there are more males to choose from. The scatterplot and residual
plot in Figure 16.20 show the relationship between the number of females and the number
of males in observed leks of barking treefrogs. You can see that the unequal variance, which
is noticeable in the scatterplot, is even more evident in the residual plot. This indicates that
the assumptions of the linear regression model are not reasonable in this situation.
35
15
10
25
Residuals
Number of females
30
20
15
10
5
17.5
0
0
10
20
50
60
30
40
Number of males
(a)
70
80
90
section
16.3Exercises
10
15.0
20
30
40
50
60
Number of males
(b)
70
10
20
30
40
%Logged
50
60
70
30
40
%Logged
50
60
70
80
90
7.5
Each exercise set assesses the following chapter learning objectives: M2, M7
Exercise Set 1
16.30 The following graphs are based on data from an
experiment to assess the effects of logging on a squirrel
population in British Columbia (Effects of Logging Pattern
2
1
Residual
10
10.0
5.0
Section 16.2
12.5
1
2
3
10
20
17.5
15.0
12.5
10.0
7.5
5.0
3
0
10
20
30
40
%Logged
50
60
70
0
Residual
3
Unless otherwise noted,
2 all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 767
Residual
1
0
1
2
20/12/12 6:39 PM
768
14
ClutchSize
10
1.64
0.9
1.28
0.8
0.67
0.7
0.0
0.5
0.3
0.67
0.2
0.1
1.28
0.05
1.64
8
16.32 Carbon aerosols have been identified as a contributing factor in a number of air quality problems.
In a chemical analysis of diesel engine exhaust, x 5
mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were
recorded (Comparison of Solvent Extraction and Thermal
set is y
5 31 1 .737x. The accompanying table gives the
observed x and y values and the corresponding standardized residuals.
8
6
4
0.95
12
0
280
290
300
310
Length(mm)
320
330
340
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
164.2
181
2.52
161.8
170
1.72
118.7
106
21.07
108.1
102
20.75
78.9
86
20.27
156.9
156
0.82
230.9
193
20.73
248.8
204
20.95
89.4
91
20.51
387.8
310
20.89
109.8
115
0.27
106.5
110
0.05
102.4
98
20.73
76.4
97
0.85
135.0
141
0.91
111.4
87.0
132
96
1.64
0.08
97.6
79.7
94
77
20.77 21.11
64.2
89.4
76
89
20.20 20.68
131.7 100.8
128
88
0.00 21.49
82.9 117.9
90
130
20.18
1.05
Residuals
2
0
2
4
6
8
85241_ch16_ptg01.indd 768
20/12/12 6:39 PM
769
Product
Maximum Width
(cm)
Minimum Width
(cm)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2.50
2.90
2.15
2.90
3.20
2.00
1.60
4.80
5.90
5.80
2.90
2.45
2.60
2.60
2.70
3.10
5.10
10.20
3.50
2.70
3.00
2.70
2.50
2.40
4.40
7.50
4.25
1.80
2.70
2.00
2.60
3.15
1.80
1.50
3.80
5.00
4.75
2.80
2.10
2.20
2.60
2.60
2.90
5.10
10.20
3.50
1.20
1.70
1.75
1.70
1.20
1.20
7.50
4.25
30
Flowering
date range
25
20
15
100
200
300
Elevation
400
500
Flowering Range
versus Elevation: Tussilago Farfara
Elevation (Meters
Above Sea Level)
23.3
5.6
55.6
140.0
31.1
112.2
106.7
42.2
75.6
176.7
126.7
126.7
176.7
201.1
133.3
90.0
41.1
125.6
477.8
Flowering
Date Range
33.4
32.0
31.9
31.3
28.1
29.3
28.4
26.6
24.9
25.7
24.7
23.5
23.2
21.8
22.3
21.4
19.7
17.6
17.6
85241_ch16_ptg01.indd 769
20/12/12 6:39 PM
770
Section 16.2
Exercise Set 2
16.35 In the study described in Exercise 16.31, the effect
of latitude on mean clutch size was investigated. Data
from various locations in Florida, Georgia, Alabama, and
Mississippi on y 5 mean clutch size and x 5 latitude were
measured. The scatterplot, standardized residual plot, and
several graphs of the standardized residuals are shown
below.
Does it appear that the assumptions of the simple linear
regression model are plausible? Explain your reasoning in
a few sentences.
3
2
1
0
23
24
22
25
27
28
34
33
36
34
8
7
5
4
4 26
27
28
26
27
28
Standardized
Standardized
Residual
Residual
29
30
Latitude
29
30
31
32
33
31
32
33
0
1
21
1
2
2 26
27
28
26
27
28
29
30
Latitude
29
30
31
32
33
31
32
33
Latitude
Normal Probability plot of the Residuals
22
22.0
21.5
21.0
20.5
0.0
0.5
Standardized residual
1.0
1.5
16.37 The accompanying scatterplot, based on 34 sediment samples with x 5 sediment depth (cm) and y 5 oil
and grease content (mg/kg), appeared in the article Mined
Land Reclamation Using Polluted Urban Navigable Waterway
Sediments (Journal of Environmental Quality [1984]: 415422).
90
Percent
1
0
50
10
1
21.83
0.04
1.45
0.20
21.07
1.19
20.24
20.13
20.81
1.17
Latitude
2
1
99
Standardized Residual
6
5
Normal score
MeanMean
Clutch
Clutch
Size Size
7
6
0
Residual
Discuss the effect that the observation (20, 33,000) will have on
the estimated regression line. If this point were omitted, what
do you think will happen to the slope of the estimated regression line compared to the slope when this point is included?
Unless otherwise noted, all content on this page is Cengage Learning.
85241_ch16_ptg01.indd 770
20/12/12 6:39 PM
Standardized R
2
0
1
2
771
40
60
80 Adequacy
100
16.3 Checking
Model
Locations/Pack
20
7
6
Frequency
32,000
28,000
5
4
3
2
24,000
1
20,000
16,000
8,000
4,000
30
60
90
120 150 180
Subsample mean depth (cm)
16.38 Investigators in northern Alaska periodically monitored radio collared wolves in 25 wolf packs over 4 years,
keeping track of the packs home ranges. (Population
Dynamics and Harvest Characteristics of Wolves in the Central
Brooks Range, Alaska, Wildlife Monographs, [2008]: 125).
Additional Exercises
16.39 Carbon acrosols have been identified as a contributing factor in a number of air quality problems. In a chemical
analysis of diesel engine exhaust, x 5 mass (mg/cm2) and
y 5 elemental carbon (mg/cm2) were recorded ("Comparison
of Solvent Extraction and Thermal Optical Carbon Analysis
Methods: Application to Diesel Vehicle Exhaust Aerosol"
Environmental science Technology [1984]: 231234). The esti
carbon
12,000
0
1
Standardized Residual
200
150
2500
100
Home Range
2000
1500
50
1000
50
100
150
200
250
mass
300
350
400
500
0
20
40
60
Locations/Pack
80
100
St. Residuals
Standardized Residual
1
2
1
0
0
1
1
2
20
40
60
Locations/Pack
80
100
2
50
100
150
200
250
mass
300
350
400
85241_ch16_ptg01.indd 771
Frequency
6
5
4
3
20/12/12 6:39 PM
772
Mean Flowering
date range
40
35
30
25
20
15
10
58
59
60
61
62
Latitude (N)
63
Flowering Range
Versus Latitude: Anemone Hepatica
Flowering
Latitude (N)
Date Range
58.7
58.2
58.2
59.4
60.0
59.4
59.1
59.3
59.5
59.5
59.7
59.8
60.8
46.1
35.9
34.7
32.3
33.0
29.7
26.9
26.2
25.6
27.6
19.1
24.4
26.2
64
60.9
63.4
63.4
60.5
60.7
60.7
61.1
(continued)
26.8
28.7
19.2
22.5
17.9
12.9
11.8
Residual
Flowering Range
Versus Latitude: Anemone Hepatica
Flowering
Latitude (N)
Date Range
100
50
0
50
Target Angle
100
150
200
85241_ch16_ptg01.indd 772
20/12/12 6:39 PM
Crown rump(cm)
4
0.75
20.28
15
0.55
0.24
5
1.20
1.92
15
0.00
22.05
6
0.55
20.90
19
0.35
20.12
9
0.60
20.28
21
0.45
0.60
14
0.65
0.54
22
0.40
0.52
773
25
30
Gest Age(days)
35
40
16.44 (C1)
Describe what distinguishes a deterministic model from a
probabilistic model.
16.45 (C2)
In the context of the simple linear regression model,
explain the difference between a and a. Between b and b.
Between se and se.
16.46 (M1)
The SAT and ACT exams are often used to predict a
students first-term college grade point average (GPA).
Different formulas are used for different colleges and
majors. Suppose that a student is applying to State U with
an intended major in civil engineering. Also suppose that
for this college and this major, the following model is used
to predict first term GPA.
GPA 5 a 1 b (ACT)
a 5 0.5
b 5 0.1
a. In this context, what would be the appropriate interpretation of a?
b. In this context, what would be the appropriate interpretation of b?
16.47 (M2)
Theropods were carnivorous dinosaurs, characterized by
short forelimbs, living in the Jurassic and Cretaceous periods. (Tyrannosaurus rex is classified as a Theropod.) What
scientists know about therapods is based on studying incomplete skeletal remains. In a study described in the paper
My Theropod is Bigger than Yoursor not: Estimating Body
Size from Skull Length in Theropods (Journal of Vertebrate
85241_ch16_ptg01.indd 773
20/12/12 6:39 PM
774
0.5
0
0.5
1
1.5
2
0.25
0.5
0.75
SkullLength
1.25
1.5
Linear Fit
BodyLength = 0.7061088 + 7.791973*SkullLength
0.95
0.9
Summary of Fit
1.64
1.28
0.8
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.67
0.5
0.0
0.2
0.1
0.05
0.67
Analysis Of Variance
1.28
1.64
Parameter Estimates
1.5
0.5
0.5
12
Estimate
0.7061088
Std Error
0.330485
SkullLength
7.791973
0.415318
t Ratio Prob>|t|
2.14
0.0475*
18.76
<.0001*
10
BodyLength
Term
Intercept
16.48 (M3)
There are 4 basic assumptions necessary for making inferences about b, the slope of the population regression line.
a. What are the four assumptions?
b. Which assumptions can be checked using sample data?
c. What statistics or graphs would be used to check each of
the assumptions you listed in Part (b)?
1.5
14
8
6
4
2
0
0.953929
0.951218
0.801042
5.859474
19
0.25
0.5
0.75
SkullLength
1.25
1.5
85241_ch16_ptg01.indd 774
20/12/12 6:39 PM
Technology Notes
Linear Fit
BroodSurvival = 0.9468008 0.0261902*StemDensity
Summary of Fit
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.9
0.9
0.8
0.8
LicePrevalence
Prevalence
Lice
0.193788
0.155397
0.287538
0.436043
23
StemDensity 0.02619
Std Error
0.235108
0.011657
0.5
0.5
16.50 (M7)
Researchers in Hawaii have recently documented a large
increase in the prevalence of a bird parasite known as chewing lice. (Explosive Increase in Ectoparasites in Hawaiian
Forest Birds, The Journal of Parasitology [2008]: 10091021).
Current data suggest that the prevalence of chewing lice
may be less for bird species with a high degree of bill
overhang. A species is said to have bill overhang when
the upper bill extends downward in front of the end of
the lower bill. The following scatterplot shown shows the
relationship between the prevalence of chewing lice and
bill overhang for 8 bird species in the Hawaiian Islands. A
residual plot is also shown. Use these plots to identify any
outliers or potentially influential observations. For each
point you identify, assess its influence on the estimated
slope of the regression line.
0.2
0.2
0.4
0.6
0.4
0.6
Bill Overhang
Bill Overhang
0.8
0.8
1.0
1.0
0.4
0.4 0.0
0.0
0.2
0.2
0.6
0.4
0.6
0.4
Bill Overhang
Bill Overhang
0.8
0.8
1.0
1.0
0.2
0.2
0.0355*
a. Is there convincing evidence of a useful linear relationship between brood survival and stem density?
Explain.
b. Would you describe the relationship as strong? Why or
why not?
c. Construct a 95% confidence interval for b and interpret
it in context.
d. What margin of error is associated with the confidence
interval in part (c)?
0.0
0.0
0.3
0.3
t Ratio Prob>|t|
4.03
0.0006*
2.25
0.6
0.6
0.3
0.3
0.1
0.1
0.0
0.0
Residual
Residual
Estimate
0.9468008
0.7
0.7
0.4
0.4
Parameter Estimates
Term
Intercept
775
0.1
0.1
0.2
0.2
0.3
0.3
16.51 (M6)
Suppose you are given the computer output shown. You
want to test the hypothesis, b 5 1.0. Describe how you
would use the computer output to test this hypothesis
Linear Fit
y = 5.6452776 + 0.9797401*x
Summary of Fit
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.985289
0.984954
12.48525
0.791304
46
Parameter Estimates
Term
Intercept
Estimate
5.6452776
Std Error
1.84302
0.9797401
0.018048
t Ratio Prob>|t|
3.06
0.0037*
54.29
<.0001*
Technology Notes
Regression Test
TI-83/84
1. Enter the data for the independent variable into L1 (In order
to access lists press the STAT key, highlight the option called
Edit then press ENTER)
85241_ch16_ptg01.indd 775
20/12/12 6:39 PM
776
TI-Nspire
1. Enter the data into two separate data lists (In order to access
data lists select the spreadsheet option and press enter)
Note: Be sure to title the lists by selecting the top row of the
column and typing a title.
2. Press the menu key and select 4:Stat Tests then 4:Stats
Tests then A:Linear Reg t Test and press enter
3. In the box next to X List choose the list title where you
stored your independent data from the drop-down menu
4. In the box next to Y List choose the list title where you
stored your dependent data from the drop-down menu
5. In the box next to Alternate Hyp choose the appropriate
alternative hypothesis from the drop-down menu
6. Press OK
JMP
1. Input the data for the dependent variable into the first
column
2. Input the data for the independent variable into the second
column
3. Click Analyze and select Fit Y by X
4. Select the dependent variable (Y) from the box under Select
Columns and click on Y, Response
5. Select the independent variable (X) from the box under
Select Columns and click on X, Factor
6. Click the red arrow next to Bivariate Fit of and select
Fit Line
MINITAB
1. Input the data for the dependent variable into the first
column
2. Input the data for the independent variable into the second
column
3. Select Stat then Regression then Regression
4. Highlight the name of the column containing the dependent
variable and click Select
5. Highlight the name of the column containing the independent variable and click Select
6. Click OK
85241_ch16_ptg01.indd 776
20/12/12 6:39 PM
Review Questions
777
(A) 18.75% of the variability in service time can be explained by the linear relationship between service
call time and number of components needing repair.
(B) There is a positive correlation between service call
time and number of components needing repair.
(C) For every 1-component increase in the number of
components needing repair, the predicted service
call time increases by about 18.75 minutes.
(D) The magnitude of a typical difference between an observed service time and the service call time predicted
by the linear model is approximately 18.75 minutes.
(E) The average service call time is 18.75 minutes.
4. The value of se is 18.75. If the assumptions of the
simple linear regression model are satisfied, which of
the following is correct?
(A) The width of a 95% confidence interval for the slope
of the population regression line is 2(18.75) 5 37.50.
(B) It would be unlikely that a prediction based on the
regression line will be greater than 18.75 minutes.
(C) It would be unlikely that a prediction based on the
regression line will differ from the actual value by
more than 2(18.75)537.50 minutes.
(D) Errors associated with predictions based on the regression line will always be less than 18.75 minutes.
(E) The value of se does not provide any information
about the anticipated magnitude of prediction errors.
5. Which of the following is a 95% confidence interval for
the change in service time associated with a 1-unit
increase in the number of components needing repair?
(A) 37.21 6 (1.96)(7.985)
(B) 37.21 6 (2.910)(7.985)
(C) 9.97 6 (1.96)(0.7218)
(D) 9.97 6 (2.10)(0.7218)
(E) 9.97 6 2(18.7534)
6. If the basic assumptions of the simple linear regression
model are reasonable, what conclusion should be
reached regarding model utility if a significance level of
0.05 is used for the model utility test?
(A) There is convincing evidence of a negative linear
relationship between service call time and number
of components needing repair.
(B) There is convincing evidence that the model is not
useful for predicting service call time.
(C) There is convincing evidence that the model is useful for predicting service call time.
(D) There is not convincing evidence that the model is
useful for predicting service call time.
(E) A conclusion cannot be reached based on the given
information.
AP* and the Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the production of, and does not endorse, this product.
85241_ch16_ptg01.indd 777
20/12/12 6:39 PM
778
(A) I only
(B) II only
(C) III only
(D) I and III only
(E) II and III only
8.5
8.0
B
D
7.5
Standardized residual
2
7.0
6.5
C
6.0
9. Which of the labeled points would have the largest residual when a linear model is fit to the data?
21
22
210
25
10
Standardized residual
2.0
(A) A
(B) B
(C) C
(D) D
(E) Both C and D
10. Which of the labeled points corresponds to a potentially
influential observation if a linear model is to be fit to
the data?
1.5
1.0
(A) A
(B) B
(C) C
(D) D
(E) Both C and D
0.5
0.0
20.5
21.0
21.5
100
110
120
130
140
Standardized residual
3
2
1
0
21
22
23
150
200
250
300
350
85241_ch16_ptg01.indd 778
20/12/12 6:39 PM
Review Questions
779
Mean
Square
1.31599
0.07400
Parameter Estimates
Term
Estimate Std Error t Ratio
Intercept
1.8928955 0.167575 11.30
Length(cm) 20.010428 0.002473 24.22
F Ratio
17.7834
Prob > F
0.0007*
Prob>|t|
,.0001*
0.0007*
12. For this data set, the model utility test is based on how
many degrees of freedom?
(A) 15
(B) 16
(C) 17
(D) 18
(E) 19
13. What is the P-value associated with the model utility
test?
2
1
0
21
22
23
150
200
250
300
350
(A) 0.0001
(B) 0.0007
(C) 0.07400
(D) 0.167575
(E) 0.526395
85241_ch16_ptg01.indd 779
20/12/12 6:39 PM