Documente Academic
Documente Profesional
Documente Cultură
Spring 2011
Chapter 14
Simple Linear Regression
where:
Slope 1
is positive
xx
Regression line
Slope 1
is negative
xx
No Relationship
EE(y)
(y)
Intercept
Regression line
Slope 1
is 0
xx
y b0 b1 x
Estimation Process
Regression Model
y = 0 + 1x +
Regression Equation
E(y) = 0 + 1x
Unknown Parameters
0, 1
b0 and b1
provide estimates of
0 and 1
Sample Data:
x
y
x 1 y1
.
.
.
.
xn yn
Estimated
Regression Equation
y b0 b1 x
Sample Statistics
b0, b1
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
b1
( x x )( y y )
(x x )
i
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
b0 y b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
y i b0 b1 xi
where
y i = estimated value of quarterly sales ($1000s) for the
ith restaurant
b0 = the y intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi = size of the student population (1000s) for the ith
restaurant
y = 100
y 20
( x x )( y y ) 20
5
4
(x x )
i
y 10 5x
Cars Sold
25
20
y = 5x + 10
15
10
5
0
0
2
TV Ads
Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE
2
2
2
(
)
(
)
(
y
y
)
y
i
i
i i
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772
The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
Coefficient of Determination
Calculation of SSE for Armands Pizza Parlors
Coefficient of Determination
Calculation of SST for Armands Pizza Parlors
Coefficient of Determination
Deviations about the Estimated Regression Line and
the Line y y for Armands Pizza Parlors
Coefficient of Determination
For the Armands Pizza Parlors example, the value of
the coefficient of determination is
r 2 = SSR/SST = 14,200/15,730 = .9027
When we express the coefficient of determination as a
percentage, r2 can be interpreted as the percentage of
the SST that can be explained by using the SSR.
90.27% of the variability in sales can be explained by
the linear relationship between the size of the student
population and sales.
rxy (sign of b1 ) r 2
where:
b1 = the slope of the estimated regression
equation y b0 b1 x
and
F Test
Both
Both the
the tt test
test and
and FF test
test require
require an
an estimate
estimate of
of 22,,
the
the variance
variance of
of in
in the
the regression
regression model.
model.
SSE ( yi y i ) 2 ( yi b0 b1 xi ) 2
SSE
s MSE
n2
Sampling Distribution of b1
Expected value
E (b1 ) 1
Standard Deviation
b
1
2
(
x
x
)
i
Distribution Form:
Normal
Estimated Standard Deviation of b1
s b1
(x x)
i
Test Statistic
b1
t
sb1
where
sb1
s
( xi x )2
Rejection Rule
Reject H0 if p-value <
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom
H0 : 1 0
H a : 1 0
= .05
b1
3. Select the test statistic. t
sb1
4. State the rejection rule.
b1
5
4.63
t
sb1 1.08
6. Determine whether to reject H0.
t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.
b1 t /2sb1
t /2sb1
is the
margin
of error
1 .
the confidence interval for
95% Confidence Interval for 1
H0 : 1 0
H a : 1 0
Test Statistic
F = MSR/MSE
Rejection Rule
Reject H0 if
p-value <
or F > F
where:
H0 : 1 0
H a : 1 0
F = MSR/MSE
Reject H0 if p-value < .05
or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)
y p t /2 sind
where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
with n - 2 degrees of freedom
Point Estimation
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
y^ = 10 + 5(3) = 25 cars
y p
( x p x )2
1
sy p s
n ( x i x )2
(3 2)2
1
sy p 2.16025
sy p 2.16025
1 1
1.4491
5 4
y p t /2 s y p
25 + 3.1824(1.4491)
25 + 4.61
25 + 4.61 = 20.39 to 29.61 cars
1
sind s 1
n ( x i x )2
1 1
sy p 2.16025 1
5 4
sy p 2.16025(1.20416) 2.6013
Computer Solution
Performing the regression analysis computations
without the help of a computer can be quite time
consuming.
On the next slide we show Minitab output for the
Reed Auto Sales example.
Recall that the independent variable was named
Ads and the dependent variable was named Cars
in the example.
Computer Solution
The regression equation is
Cars = 10 + 5.00 Ads
Predictor
Coef
SE Coef
Constant
10.000
2.366
4.23
0.024
Ads
5.0000
1.0801
4.63
0.019
S = 2.2
R-sq = 87.7%
R-sq(adj) = 83.6%
Analysis of Variance
SOURCE
DF
SS
MS
Regression
100
100
21.43
0.019
Residual Error
14
4.667
Total
114
Fit
SE Fit
25.00
2.60
New
Obs
1
95% C.I.
95% P.I.
(20.39, 29.61)
(16.72, 33.28)
Residual Analysis
If the assumptions about the error term appear
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid.
The residuals provide the best information about .
Residual for Observation i
y i y i
Much of the residual analysis is based on an
examination of graphical plots.
Residual Analysis
Assumptions about the error term
1. E( ) = 0.
2. The variance of , denoted by 2 , is
the same for all values of x.
3. The values of are independent.
4. The error term has a normal distribution.
Residual
y y
Good Pattern
Residual
y y
Nonconstant Variance
Residual
y y
Histogram
Histogram of the Residuals
(response is m1)
70
60
Frequency
50
40
30
20
10
0
-0.2
-0.1
0.0
0.1
Residual
0.2
0.3
0.4
Probability
Expected e
.999
.99
.95
.80
.50
.20
.05
.01
.001
-2
-1
SRES1
Average: -0.0021487
StDev: 1.00620
N: 144
Residuals
Observation
Residuals
15
-1
25
-1
20
-2
15
25
Residuals
2
1
0
-1
-2
-3
0
TV Ads
Residual Analysis :
Outliers and Influential Observations
Outlier, a data point (observation) that does not fit the
trend shown by the remaining data.
Outliers represent observations that are suspect and
warrant careful examination. They may represent
erroneous
Residual Analysis :
Outliers and Influential Observations
An example of an influential observation in simple
linear regression
The estimated regression line has a negative slope. If
the influential observation were dropped from the data
set, the slope would change from negative to positive
and the y-intercept would be smaller.
Residual Analysis :
Outliers and Influential Observations
Observations with extreme values for the independent
variables are called high leverage points
The leverage of an observation is determined by how far
the values of the independent variables are from their
mean values
Standardized Residuals
where:
syi y i s 1 hi
( x i x )2
1
hi
n ( x i x )2
Standardized Residuals
28
29
30
31
32
33
34
35
36
37
1.5
1
RESIDUAL
OUTPUT
0.5
Observation
0
-0.5 0
-1
-1.5
1
2
3
4
5
Predicted Y
15
10
25
20
15
25
Cars Sold
Detecting Outliers
An outlier is an observation that is unusual in
comparison with the other data.
Minitab classifies an observation as an outlier if
its standardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails
to identify an unusually large observation as
being an outlier.
This rules shortcoming can be circumvented
by using studentized deleted residuals.
The |i th studentized deleted residual| will be
larger than the |i th standardized residual|.
End of Chapter 14