Sunteți pe pagina 1din 83

2011329

Spring 2011

Chapter 14
Simple Linear Regression

Simple Linear Regression Model


Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Computer Solution
Residual Analysis: Validating Model Assumptions

Simple Linear Regression Model


Regression terminology
dependent variable (y): the variable being predicted
independent variables (x): the variable or variables
being used to predict the value of the dependent
variable.
Example
The effect of advertising expenditures on sales, a
marketing managers desire to predict sales would
suggest making sales the dependent variable.
Advertising expenditure would be the independent
variable used to help predict sales.

Simple Linear Regression Model


Simple linear regression involves one
independent variable and one dependent
variable, The relationship between the two
variables is approximated by a straight line.
Multiple regression analysis : regression
analysis involving two or more independent
variables.

Simple Linear Regression Model


The equation that describes how y is related to x and
an error term is called the regression model.
The simple linear regression model is:
y = 0 + 1x +

where:

0 and 1 are called parameters of the model,


is a random variable called the error term.

Simple Linear Regression Equation


The simple linear regression equation is:
E(y) = 0 + 1x

Graph of the regression equation is a straight line.


0 is the y intercept of the regression line.
1 is the slope of the regression line.
E(y) is the expected value of y for a given x value.

Simple Linear Regression Equation

Positive Linear Relationship


EE(y)
(y)
Regression line
Intercept

Slope 1
is positive
xx

Simple Linear Regression Equation

Negative Linear Relationship


EE(y)
(y)
Intercept

Regression line

Slope 1
is negative
xx

Simple Linear Regression Equation

No Relationship
EE(y)
(y)
Intercept

Regression line
Slope 1
is 0
xx

Estimated Simple Linear Regression Equation


The estimated simple linear regression equation

y b0 b1 x

The graph is called the estimated regression line


b0 is the y intercept of the line
b1 is the slope of the line
y is the estimated value of y for a given x value.

Estimation Process
Regression Model
y = 0 + 1x +
Regression Equation
E(y) = 0 + 1x
Unknown Parameters
0, 1

b0 and b1
provide estimates of
0 and 1

Sample Data:
x
y
x 1 y1
.
.
.
.
xn yn

Estimated
Regression Equation

y b0 b1 x

Sample Statistics
b0, b1

Least Squares Method


Least Squares Criterion
min (y i y i ) 2

where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation

Least Squares Method


Slope for the Estimated Regression Equation

b1

( x x )( y y )

(x x )
i

where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations

Least Squares Method

y-Intercept for the Estimated Regression Equation

b0 y b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations

Simple Linear Regression Equation


Example: Suppose data were collected from a sample of
10 Armands Pizza Parlor restaurants located near
college campuses.
Data

Simple Linear Regression Equation


Scatter Plot: these data the relationship between the
size of the student population and quarterly sales
appears to be approximated by a straight line.

Simple Linear Regression Equation


For the ith restaurant, the estimated regression
equation provides

y i b0 b1 xi

where
y i = estimated value of quarterly sales ($1000s) for the
ith restaurant
b0 = the y intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi = size of the student population (1000s) for the ith
restaurant

Simple Linear Regression Equation


Calculations for The Least Squares Estimated
Regression Equation for Armand Pizza Parlors

Simple Linear Regression Equation


The estimated regression equation is
y = 60 + 5x
Graph

Simple Linear Regression

Example: Reed Auto Sales


Reed Auto periodically has
a special week-long sale.
As part of the advertising
campaign Reed runs one or
more television commercials
during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.

Simple Linear Regression

Example: Reed Auto Sales


Number of Number of
TV Ads (x) Cars Sold (y)
14
1
24
3
18
2
17
1
27
3
x = 10
x2

y = 100
y 20

Estimated Regression Equation


Slope for the Estimated Regression Equation
b1

( x x )( y y ) 20

5
4
(x x )
i

y-Intercept for the Estimated Regression Equation


b0 y b1 x 20 5(2) 10

Estimated Regression Equation

y 10 5x

Scatter Diagram and Trend Line


30

Cars Sold

25
20
y = 5x + 10
15
10
5
0
0

2
TV Ads

Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE
2
2
2

(
)
(
)
(
y

y
)

y
i
i
i i

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

Coefficient of Determination

The coefficient of determination is:


r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares

Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772
The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.

Coefficient of Determination
Calculation of SSE for Armands Pizza Parlors

Coefficient of Determination
Calculation of SST for Armands Pizza Parlors

Coefficient of Determination
Deviations about the Estimated Regression Line and
the Line y y for Armands Pizza Parlors

Coefficient of Determination
For the Armands Pizza Parlors example, the value of
the coefficient of determination is
r 2 = SSR/SST = 14,200/15,730 = .9027
When we express the coefficient of determination as a
percentage, r2 can be interpreted as the percentage of
the SST that can be explained by using the SSR.
90.27% of the variability in sales can be explained by
the linear relationship between the size of the student
population and sales.

Sample Correlation Coefficient


rxy (sign of b1 ) Coefficient of Determination

rxy (sign of b1 ) r 2
where:
b1 = the slope of the estimated regression
equation y b0 b1 x

Sample Correlation Coefficient


rxy (sign of b1 ) r 2
The sign of b1 in the equation y 10 5 x is +.
rxy .8772
rxy = +.9366

Assumptions About the Error Term


1.
1. The
The error
error is
is aa random
random variable
variable with
with mean
mean of
of zero.
zero.
2.
2. The
The variance
variance of
of ,, denoted
denoted by
by 22,, is
is the
the same
same for
for
all
all values
values of
of the
the independent
independent variable.
variable.
3.
3. The
The values
values of
of are
are independent.
independent.
4.
4. The
The error
error is
is aa normally
normally distributed
distributed random
random
variable.
variable.

Assumptions About the Error Term


Implication
1. 0 and 1 are constants and
E ( y ) 0 1 x

2. The variance of y about the regression line equals


2 and is the same for all values of x.
3. The value of for a particular value of x is
not related to the value of for any other
value of x; thus, the value of y for a particular value
of x is not related to the value of y for any other value
of x.
4. Because y is a linear function of, y is also a normally
distributed random variable.

Assumptions About the Error Term


Assumptions for the Regression Model

Testing for Significance


To
To test
test for
for aa significant
significant regression
regression relationship,
relationship, we
we
must
must conduct
conduct aa hypothesis
hypothesis test
test to
to determine
determine whether
whether
the
the value
value of
of 11 is
is zero.
zero.
Two
Two tests
tests are
are commonly
commonly used:
used:
t Test

and

F Test

Both
Both the
the tt test
test and
and FF test
test require
require an
an estimate
estimate of
of 22,,
the
the variance
variance of
of in
in the
the regression
regression model.
model.

Testing for Significance


An Estimate of
The mean square error (MSE) provides the estimate
of , and the notation s2 is also used.
s 2 = MSE = SSE/(n 2)
where:

SSE ( yi y i ) 2 ( yi b0 b1 xi ) 2

Testing for Significance


An Estimate of
To estimate we take the square root of s 2.
The resulting s is called the standard error of
the estimate.

SSE
s MSE
n2

Sampling Distribution of b1
Expected value

E (b1 ) 1
Standard Deviation
b
1

2
(
x

x
)
i

Distribution Form:

Normal
Estimated Standard Deviation of b1
s b1

(x x)
i

Testing for Significance: t Test


Hypotheses
H0 : 1 0
H a : 1 0

Test Statistic

b1
t
sb1

where

sb1

s
( xi x )2

Testing for Significance: t Test

Rejection Rule
Reject H0 if p-value <
or t < -tor t > t
where:
t is based on a t distribution
with n - 2 degrees of freedom

Testing for Significance: t Test


1. Determine the hypotheses.

H0 : 1 0
H a : 1 0

2. Specify the level of significance.

= .05

b1
3. Select the test statistic. t
sb1
4. State the rejection rule.

Reject H0 if p-value < .05


or |t| > 3.182 (with
3 degrees of freedom)

Testing for Significance: t Test


5. Compute the value of the test statistic.

b1
5

4.63
t
sb1 1.08
6. Determine whether to reject H0.
t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0.

Testing for Significance: t Test


Summary

Confidence Interval for 1


We can use a 95% confidence interval for 1 to test
the hypotheses just used in the t test.
H0 is rejected if the hypothesized value of 1 is not
included in the confidence interval for 1.

Confidence Interval for 1


The form of a confidence interval for 1 is:
b1 is the
point
where
estimator

b1 t /2sb1

t /2sb1

is the
margin
of error

t / 2 is the t value providing an area

of /2 in the upper tail of a t distribution


with n - 2 degrees of freedom

Confidence Interval for 1


Rejection Rule
Reject H0 if 0 is not included in

1 .
the confidence interval for
95% Confidence Interval for 1

b1 t / 2 sb1 = 5 3.182(1.08) = 5 3.44


or 1.56 to 8.44
Conclusion
0 is not included in the confidence interval.
Reject H0

Testing for Significance: F Test


Hypotheses

H0 : 1 0
H a : 1 0
Test Statistic
F = MSR/MSE

Testing for Significance: F Test

Rejection Rule
Reject H0 if
p-value <
or F > F
where:

F is based on an F distribution with


1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator

Testing for Significance: F Test


1. Determine the hypotheses.

H0 : 1 0
H a : 1 0

2. Specify the level of significance. = .05


3. Select the test statistic.
4. State the rejection rule.

F = MSR/MSE
Reject H0 if p-value < .05
or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)

Testing for Significance: F Test


5. Compute the value of the test statistic.
F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0.
F = 17.44 provides an area of .025 in the upper
tail. Thus, the p-value corresponding to F = 21.43
is less than 2(.025) = .05. Hence, we reject H0.
The statistical evidence is sufficient to conclude
that we have a significant relationship between the
number of TV ads aired and the number of cars sold.

Testing for Significance: F Test


F test Summary

Testing for Significance: F Test


General Form of the ANOVA Table for Simple Linear
Regression

Some Cautions about the


Interpretation of Significance Tests
Rejecting H0: b1 = 0 and concluding that the
relationship between x and y is significant does
not enable us to conclude that a cause-and-effect
relationship is present between x and y.
Just because we are able to reject H0: b1 = 0 and
demonstrate statistical significance does not enable
us to conclude that there is a linear relationship
between x and y.

Some Cautions about the


Interpretation of Significance Tests
Example of Linear Approximation of A Nonlinear
Relationship

Using the Estimated Regression Equation


for Estimation and Prediction

Confidence Interval Estimate of E(yp)


y p t /2 s y p

Prediction Interval Estimate of yp

y p t /2 sind
where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
with n - 2 degrees of freedom

Point Estimation
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
y^ = 10 + 5(3) = 25 cars

Confidence Interval for E(yp)


Estimate of the Standard Deviation of

y p

( x p x )2
1
sy p s

n ( x i x )2
(3 2)2
1
sy p 2.16025

5 (1 2)2 (3 2)2 (2 2)2 (1 2)2 (3 2)2

sy p 2.16025

1 1
1.4491
5 4

Confidence Interval for E(yp)


The 95% confidence interval estimate of the mean
number of cars sold when 3 TV ads are run is:

y p t /2 s y p
25 + 3.1824(1.4491)
25 + 4.61
25 + 4.61 = 20.39 to 29.61 cars

Prediction Interval for yp

Estimate of the Standard Deviation of an


Individual Value of yp
( x p x )2

1
sind s 1
n ( x i x )2

1 1
sy p 2.16025 1
5 4
sy p 2.16025(1.20416) 2.6013

Prediction Interval for yp


The 95% prediction interval estimate of the
number of cars sold in one particular week when
3 TV ads are run is:
y p t /2 sind
25 + 3.1824(2.6013)
25 + 8.28
16.72 to 33.28 cars

Computer Solution
Performing the regression analysis computations
without the help of a computer can be quite time
consuming.
On the next slide we show Minitab output for the
Reed Auto Sales example.
Recall that the independent variable was named
Ads and the dependent variable was named Cars
in the example.

Computer Solution
The regression equation is
Cars = 10 + 5.00 Ads
Predictor

Coef

SE Coef

Constant

10.000

2.366

4.23

0.024

Ads

5.0000

1.0801

4.63

0.019

S = 2.2

R-sq = 87.7%

R-sq(adj) = 83.6%

Analysis of Variance

SOURCE

DF

SS

MS

Regression

100

100

21.43

0.019

Residual Error

14

4.667

Total

114

Fit

SE Fit

25.00

2.60

Predicted Values for New Observations

New
Obs
1

95% C.I.

95% P.I.

(20.39, 29.61)

(16.72, 33.28)

Residual Analysis
If the assumptions about the error term appear
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid.
The residuals provide the best information about .
Residual for Observation i

y i y i
Much of the residual analysis is based on an
examination of graphical plots.

Residual Analysis
Assumptions about the error term
1. E( ) = 0.
2. The variance of , denoted by 2 , is
the same for all values of x.
3. The values of are independent.
4. The error term has a normal distribution.

Residual Plot Against x

If the assumption that the variance of e is the same for


all values of x is valid, and the assumed regression
model is an adequate representation of the relationship
between the variables, then

The residual plot should give an overall


impression of a horizontal band of points

Residual Plot Against x

Residual

y y

Good Pattern

Residual Plot Against x

Residual

y y

Nonconstant Variance

Residual Plot Against x

Residual

y y

Model Form Not Adequate

Histogram
Histogram of the Residuals
(response is m1)
70
60

Frequency

50
40
30
20
10
0
-0.2

-0.1

0.0

0.1

Residual

0.2

0.3

0.4

Normal Probability Plot


Normal Probability Plot

Probability

Expected e
.999
.99
.95
.80
.50
.20
.05
.01
.001
-2

-1

SRES1
Average: -0.0021487
StDev: 1.00620
N: 144

W-test for Normality


R:
0.8938
P-Value (approx): < 0.0100

The null hypothesis is that the residuals


are normal

Residual Plot Against x

Residuals
Observation

Predicted Cars Sold

Residuals

15

-1

25

-1

20

-2

15

25

Residual Plot Against x


TV Ads Residual Plot

Residuals

2
1
0
-1
-2
-3
0

TV Ads

Residual Analysis :
Outliers and Influential Observations
Outlier, a data point (observation) that does not fit the
trend shown by the remaining data.
Outliers represent observations that are suspect and
warrant careful examination. They may represent
erroneous

Residual Analysis :
Outliers and Influential Observations
An example of an influential observation in simple
linear regression
The estimated regression line has a negative slope. If
the influential observation were dropped from the data
set, the slope would change from negative to positive
and the y-intercept would be smaller.

Residual Analysis :
Outliers and Influential Observations
Observations with extreme values for the independent
variables are called high leverage points
The leverage of an observation is determined by how far
the values of the independent variables are from their
mean values

Standardized Residuals

Standardized Residual for Observation i


y i y i
syi y i

where:

syi y i s 1 hi

( x i x )2
1
hi
n ( x i x )2

Standardized Residual Plot

The standardized residual plot can provide


insight about the assumption that the error
term e has a normal distribution.

If this assumption is satisfied, the distribution


of the standardized residuals should appear to
come from a standard normal probability
distribution.

Standardized Residual Plot

Standardized Residuals

Standardized Residual Plot

Standardized Residual Plot


Standard Residuals

28
29
30
31
32
33
34
35
36
37

1.5

1
RESIDUAL
OUTPUT
0.5

Observation
0

-0.5 0
-1
-1.5

1
2
3
4
5

Predicted Y
15
10
25
20
15
25

Residuals ndard Residu


-1 -0.534522
20
30
-1 -0.534522
-2 -1.069045
2 1.069045
2 1.069045

Cars Sold

Standardized Residual Plot

All of the standardized residuals are between


1.5 and +1.5 indicating that there is no reason
to question the assumption that e has a normal
distribution.

Outliers and Influential Observations

Detecting Outliers
An outlier is an observation that is unusual in
comparison with the other data.
Minitab classifies an observation as an outlier if
its standardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails
to identify an unusually large observation as
being an outlier.
This rules shortcoming can be circumvented
by using studentized deleted residuals.
The |i th studentized deleted residual| will be
larger than the |i th standardized residual|.

End of Chapter 14

S-ar putea să vă placă și