Stat992 (Chap14)

2011329
Spring 2011
Chapter 14
Simple Linear Regression
Simple Linear Regression Model

Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Computer Solution
Residual Analysis: Validating Model Assumptions

Regression terminology
dependent variable (y): the variable being predicted
independent variables (x): the variable or variables
being used to predict the value of the dependent
variable.
Example
The effect of advertising expenditures on sales, a
marketing managers desire to predict sales would
suggest making sales the dependent variable.
Advertising expenditure would be the independent
variable used to help predict sales.

Simple linear regression involves one
independent variable and one dependent
variable, The relationship between the two
variables is approximated by a straight line.
Multiple regression analysis : regression
analysis involving two or more independent
variables.

The equation that describes how y is related to x and
an error term is called the regression model.
The simple linear regression model is:
y = 0 + 1x +
where:
0 and 1 are called parameters of the model,

is a random variable called the error term.
Simple Linear Regression Equation

The simple linear regression equation is:
E(y) = 0 + 1x
Graph of the regression equation is a straight line.

0 is the y intercept of the regression line.
1 is the slope of the regression line.
E(y) is the expected value of y for a given x value.
Positive Linear Relationship

EE(y)
(y)
Regression line
Intercept
Slope 1
is positive
xx
Negative Linear Relationship

EE(y)
(y)
Intercept
Regression line
Slope 1
is negative
xx
No Relationship
EE(y)
(y)
Intercept
Regression line
Slope 1
is 0
xx
Estimated Simple Linear Regression Equation

The estimated simple linear regression equation
y b0 b1 x
The graph is called the estimated regression line

b0 is the y intercept of the line
b1 is the slope of the line
y is the estimated value of y for a given x value.
Estimation Process
Regression Model
y = 0 + 1x +
Regression Equation
E(y) = 0 + 1x
Unknown Parameters
0, 1
b0 and b1
provide estimates of
0 and 1
Sample Data:
x
y
x 1 y1
.
.
.
.
xn yn
Estimated
Regression Equation
y b0 b1 x
Sample Statistics
b0, b1

Least Squares Criterion
min (y i y i ) 2
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation

Slope for the Estimated Regression Equation
b1
( x x )( y y )
(x x )
i
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
y-Intercept for the Estimated Regression Equation
b0 y b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations

Example: Suppose data were collected from a sample of
10 Armands Pizza Parlor restaurants located near
college campuses.
Data

Scatter Plot: these data the relationship between the
size of the student population and quarterly sales
appears to be approximated by a straight line.

For the ith restaurant, the estimated regression
equation provides
y i b0 b1 xi
where
y i = estimated value of quarterly sales ($1000s) for the
ith restaurant
b0 = the y intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi = size of the student population (1000s) for the ith
restaurant

Calculations for The Least Squares Estimated
Regression Equation for Armand Pizza Parlors

The estimated regression equation is
y = 60 + 5x
Graph
Example: Reed Auto Sales

Reed Auto periodically has
a special week-long sale.
As part of the advertising
campaign Reed runs one or
more television commercials
during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.
Example: Reed Auto Sales

Number of Number of
TV Ads (x) Cars Sold (y)
14
1
24
3
18
2
17
1
27
3
x = 10
x2
y = 100
y 20
Estimated Regression Equation

Slope for the Estimated Regression Equation
b1
( x x )( y y ) 20
5
4
(x x )
i
y-Intercept for the Estimated Regression Equation

b0 y b1 x 20 5(2) 10
Estimated Regression Equation
y 10 5x
Scatter Diagram and Trend Line

30
Cars Sold
25
20
y = 5x + 10
15
10
5
0
0
2
TV Ads
Relationship Among SST, SSR, SSE
SST = SSR + SSE
2
2
2
(
)
(
)
(
y
y
)
y
i
i
i i
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
The coefficient of determination is:

r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
r2 = SSR/SST = 100/114 = .8772
The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
Calculation of SSE for Armands Pizza Parlors
Calculation of SST for Armands Pizza Parlors
Deviations about the Estimated Regression Line and
the Line y y for Armands Pizza Parlors
For the Armands Pizza Parlors example, the value of
the coefficient of determination is
r 2 = SSR/SST = 14,200/15,730 = .9027
When we express the coefficient of determination as a
percentage, r2 can be interpreted as the percentage of
the SST that can be explained by using the SSR.
90.27% of the variability in sales can be explained by
the linear relationship between the size of the student
population and sales.
Sample Correlation Coefficient

rxy (sign of b1 ) Coefficient of Determination
rxy (sign of b1 ) r 2
where:
b1 = the slope of the estimated regression
equation y b0 b1 x
Sample Correlation Coefficient

rxy (sign of b1 ) r 2
The sign of b1 in the equation y 10 5 x is +.
rxy .8772
rxy = +.9366
Assumptions About the Error Term

1.
1. The
The error
error is
is aa random
random variable
variable with
with mean
mean of
of zero.
zero.
2.
2. The
The variance
variance of
of ,, denoted
denoted by
by 22,, is
is the
the same
same for
for
all
all values
values of
of the
the independent
independent variable.
variable.
3.
3. The
The values
values of
of are
are independent.
independent.
4.
4. The
The error
error is
is aa normally
normally distributed
distributed random
random
variable.
variable.

Implication
1. 0 and 1 are constants and
E ( y ) 0 1 x
2. The variance of y about the regression line equals

2 and is the same for all values of x.
3. The value of for a particular value of x is
not related to the value of for any other
value of x; thus, the value of y for a particular value
of x is not related to the value of y for any other value
of x.
4. Because y is a linear function of, y is also a normally
distributed random variable.

Assumptions for the Regression Model

To
To test
test for
for aa significant
significant regression
regression relationship,
relationship, we
we
must
must conduct
conduct aa hypothesis
hypothesis test
test to
to determine
determine whether
whether
the
the value
value of
of 11 is
is zero.
zero.
Two
Two tests
tests are
are commonly
commonly used:
used:
t Test
and
F Test
Both
Both the
the tt test
test and
and FF test
test require
require an
an estimate
estimate of
of 22,,
the
the variance
variance of
of in
in the
the regression
regression model.
model.

An Estimate of
The mean square error (MSE) provides the estimate
of , and the notation s2 is also used.
s 2 = MSE = SSE/(n 2)
where:
SSE ( yi y i ) 2 ( yi b0 b1 xi ) 2

An Estimate of
To estimate we take the square root of s 2.
The resulting s is called the standard error of
the estimate.
SSE
s MSE
n2
Sampling Distribution of b1
Expected value
E (b1 ) 1
Standard Deviation
b
1
2
(
x
x
)
i
Distribution Form:
Normal
Estimated Standard Deviation of b1
s b1
(x x)
i
Testing for Significance: t Test

Hypotheses
H0 : 1 0
H a : 1 0
Test Statistic
b1
t
sb1
where
sb1
s
( xi x )2
Rejection Rule
Reject H0 if p-value <
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom

1. Determine the hypotheses.
H0 : 1 0
H a : 1 0
2. Specify the level of significance.
= .05
b1
3. Select the test statistic. t
sb1
4. State the rejection rule.
Reject H0 if p-value < .05

or |t| > 3.182 (with
3 degrees of freedom)

5. Compute the value of the test statistic.
b1
5
4.63
t
sb1 1.08
6. Determine whether to reject H0.
t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.

Summary
Confidence Interval for 1

We can use a 95% confidence interval for 1 to test
the hypotheses just used in the t test.
H0 is rejected if the hypothesized value of 1 is not
included in the confidence interval for 1.

The form of a confidence interval for 1 is:
b1 is the
point
where
estimator
b1 t /2sb1
t /2sb1
is the
margin
of error
t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution


Rejection Rule
Reject H0 if 0 is not included in
1 .
the confidence interval for
95% Confidence Interval for 1
b1 t / 2 sb1 = 5 3.182(1.08) = 5 3.44

or 1.56 to 8.44
Conclusion
0 is not included in the confidence interval.
Reject H0
Testing for Significance: F Test

Hypotheses
H0 : 1 0
H a : 1 0
Test Statistic
F = MSR/MSE
Rejection Rule
Reject H0 if
p-value <
or F > F
where:
F is based on an F distribution with

1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator

1. Determine the hypotheses.
H0 : 1 0
H a : 1 0
2. Specify the level of significance. = .05

3. Select the test statistic.
4. State the rejection rule.
F = MSR/MSE
Reject H0 if p-value < .05
or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)

5. Compute the value of the test statistic.
F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0.
F = 17.44 provides an area of .025 in the upper
tail. Thus, the p-value corresponding to F = 21.43
is less than 2(.025) = .05. Hence, we reject H0.
The statistical evidence is sufficient to conclude
that we have a significant relationship between the
number of TV ads aired and the number of cars sold.

F test Summary

General Form of the ANOVA Table for Simple Linear
Regression
Some Cautions about the

Interpretation of Significance Tests
Rejecting H0: b1 = 0 and concluding that the
relationship between x and y is significant does
not enable us to conclude that a cause-and-effect
relationship is present between x and y.
Just because we are able to reject H0: b1 = 0 and
demonstrate statistical significance does not enable
us to conclude that there is a linear relationship
between x and y.
Some Cautions about the

Interpretation of Significance Tests
Example of Linear Approximation of A Nonlinear
Relationship
Using the Estimated Regression Equation

for Estimation and Prediction
Confidence Interval Estimate of E(yp)

y p t /2 s y p
Prediction Interval Estimate of yp
y p t /2 sind
where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
Point Estimation
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
y^ = 10 + 5(3) = 25 cars
Confidence Interval for E(yp)

Estimate of the Standard Deviation of
y p
( x p x )2
1
sy p s
n ( x i x )2
(3 2)2
1
sy p 2.16025
5 (1 2)2 (3 2)2 (2 2)2 (1 2)2 (3 2)2
sy p 2.16025
1 1
1.4491
5 4
Confidence Interval for E(yp)

The 95% confidence interval estimate of the mean
number of cars sold when 3 TV ads are run is:
y p t /2 s y p
25 + 3.1824(1.4491)
25 + 4.61
25 + 4.61 = 20.39 to 29.61 cars
Prediction Interval for yp
Estimate of the Standard Deviation of an

Individual Value of yp
( x p x )2
1
sind s 1
n ( x i x )2
1 1
sy p 2.16025 1
5 4
sy p 2.16025(1.20416) 2.6013
Prediction Interval for yp

The 95% prediction interval estimate of the
number of cars sold in one particular week when
3 TV ads are run is:
y p t /2 sind
25 + 3.1824(2.6013)
25 + 8.28
16.72 to 33.28 cars
Computer Solution
Performing the regression analysis computations
without the help of a computer can be quite time
consuming.
On the next slide we show Minitab output for the
Reed Auto Sales example.
Recall that the independent variable was named
Ads and the dependent variable was named Cars
in the example.
Computer Solution
The regression equation is
Cars = 10 + 5.00 Ads
Predictor
Coef
SE Coef
Constant
10.000
2.366
4.23
0.024
Ads
5.0000
1.0801
4.63
0.019
S = 2.2
R-sq = 87.7%
R-sq(adj) = 83.6%
Analysis of Variance
SOURCE
DF
SS
MS
Regression
100
100
21.43
0.019
Residual Error
14
4.667
Total
114
Fit
SE Fit
25.00
2.60
Predicted Values for New Observations
New
Obs
1
95% C.I.
95% P.I.
(20.39, 29.61)
(16.72, 33.28)
Residual Analysis
If the assumptions about the error term appear
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid.
The residuals provide the best information about .
Residual for Observation i
y i y i
Much of the residual analysis is based on an
examination of graphical plots.
Residual Analysis
Assumptions about the error term
1. E( ) = 0.
2. The variance of , denoted by 2 , is
the same for all values of x.
3. The values of are independent.
4. The error term has a normal distribution.
Residual Plot Against x
If the assumption that the variance of e is the same for

all values of x is valid, and the assumed regression
model is an adequate representation of the relationship
between the variables, then
The residual plot should give an overall

impression of a horizontal band of points
Residual
y y
Good Pattern
Residual
y y
Nonconstant Variance
Residual
y y
Model Form Not Adequate
Histogram
Histogram of the Residuals
(response is m1)
70
60
Frequency
50
40
30
20
10
0
-0.2
-0.1
0.0
0.1
Residual
0.2
0.3
0.4
Normal Probability Plot

Normal Probability Plot
Probability
Expected e
.999
.99
.95
.80
.50
.20
.05
.01
.001
-2
-1
SRES1
Average: -0.0021487
StDev: 1.00620
N: 144
W-test for Normality

R:
0.8938
P-Value (approx): < 0.0100
The null hypothesis is that the residuals

are normal
Residuals
Observation
Predicted Cars Sold
Residuals
15
-1
25
-1
20
-2
15
25

TV Ads Residual Plot
Residuals
2
1
0
-1
-2
-3
0
TV Ads
Residual Analysis :
Outliers and Influential Observations
Outlier, a data point (observation) that does not fit the
trend shown by the remaining data.
Outliers represent observations that are suspect and
warrant careful examination. They may represent
erroneous
Residual Analysis :
An example of an influential observation in simple
linear regression
The estimated regression line has a negative slope. If
the influential observation were dropped from the data
set, the slope would change from negative to positive
and the y-intercept would be smaller.
Residual Analysis :
Observations with extreme values for the independent
variables are called high leverage points
The leverage of an observation is determined by how far
the values of the independent variables are from their
mean values
Standardized Residuals
Standardized Residual for Observation i

y i y i
syi y i
where:
syi y i s 1 hi
( x i x )2
1
hi
n ( x i x )2
Standardized Residual Plot
The standardized residual plot can provide

insight about the assumption that the error
term e has a normal distribution.
If this assumption is satisfied, the distribution

of the standardized residuals should appear to
come from a standard normal probability
distribution.
Standardized Residuals

Standard Residuals
28
29
30
31
32
33
34
35
36
37
1.5
1
RESIDUAL
OUTPUT
0.5
Observation
0
-0.5 0
-1
-1.5
1
2
3
4
5
Predicted Y
15
10
25
20
15
25
Residuals ndard Residu

-1 -0.534522
20
30
-1 -0.534522
-2 -1.069045
2 1.069045
2 1.069045
Cars Sold
All of the standardized residuals are between

1.5 and +1.5 indicating that there is no reason
to question the assumption that e has a normal
distribution.
Detecting Outliers
An outlier is an observation that is unusual in
comparison with the other data.
Minitab classifies an observation as an outlier if
its standardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails
to identify an unusually large observation as
being an outlier.
This rules shortcoming can be circumvented
by using studentized deleted residuals.
The |i th studentized deleted residual| will be
larger than the |i th standardized residual|.
End of Chapter 14

Stat992 (Chap14)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Stat992 (Chap14)

Încărcat de

Drepturi de autor:

Formate disponibile

2011329

Simple Linear Regression Model

Simple Linear Regression Model

Simple Linear Regression Model

Simple Linear Regression Model

0 and 1 are called parameters of the model,

Simple Linear Regression Equation

Graph of the regression equation is a straight line.

Simple Linear Regression Equation

Positive Linear Relationship

Simple Linear Regression Equation

Negative Linear Relationship

Simple Linear Regression Equation

Estimated Simple Linear Regression Equation

The graph is called the estimated regression line

Least Squares Method

Least Squares Method

Least Squares Method

y-Intercept for the Estimated Regression Equation

Simple Linear Regression Equation

Simple Linear Regression Equation

Simple Linear Regression Equation

Simple Linear Regression Equation

Simple Linear Regression Equation

Simple Linear Regression

Example: Reed Auto Sales

Simple Linear Regression

Example: Reed Auto Sales

Estimated Regression Equation

y-Intercept for the Estimated Regression Equation

Estimated Regression Equation

Scatter Diagram and Trend Line

The coefficient of determination is:

Sample Correlation Coefficient

Sample Correlation Coefficient

Assumptions About the Error Term

Assumptions About the Error Term

2. The variance of y about the regression line equals

Assumptions About the Error Term

Testing for Significance

Testing for Significance

Testing for Significance

Testing for Significance: t Test

Testing for Significance: t Test

Testing for Significance: t Test

2. Specify the level of significance.

Reject H0 if p-value < .05

Testing for Significance: t Test

Testing for Significance: t Test

Confidence Interval for 1

Confidence Interval for 1

t / 2 is the t value providing an area

of /2 in the upper tail of a t distribution

Confidence Interval for 1

b1 t / 2 sb1 = 5 3.182(1.08) = 5 3.44

Testing for Significance: F Test

Testing for Significance: F Test

F is based on an F distribution with

Testing for Significance: F Test

2. Specify the level of significance. = .05

Testing for Significance: F Test

Testing for Significance: F Test

Testing for Significance: F Test