Sunteți pe pagina 1din 70

Week 6

Simple Linear Regression

Models

Representation of some phenomenon


Mathematical model is a mathematical
expression of some phenomenon
Often describe relationships between
variables
Types

Deterministic models
Probabilistic models

Deterministic Models

Hypothesize exact relationships


Suitable when prediction error is negligible
Example: force is exactly mass times
acceleration

F = ma

1984-1994 T/Maker Co.

Probabilistic Models

Hypothesize two components


Deterministic
Random error

Example: sales volume (y) is 10 times


advertising spending (x) + random error
y = 10x +
Random error may be due to factors
other than advertising

General Form of Probabilistic


Models
y = Deterministic component + Random error
where y is the variable of interest.
We always assume that the mean value of the
random error equals 0:

E(y) = Deterministic component

A First-Order (Straight Line)


Probabilistic Model
y=

1x

where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)
E(y) = 0 + 1x = Deterministic component
(epsilon) = Random error component

A First-Order (Straight Line)


Probabilistic Model
y=

1x

(beta zero) = y-intercept of the line, that is, the


point at which the line intercepts
or cuts through the y-axis
1 (beta one) = slope of the line, that is, the
change (amount of increase or
decrease) in the deterministic
component of y for every 1-unit
increase in x
0

A First-Order (Straight Line)


Probabilistic Model
A positive slope implies that E(y) increases by the
amount 1 for each unit increase in x.
A negative slope implies that E(y) decreases by
the amount 1.

Five-Step Procedure
Step 1:

Step 2:
Step 3:

Step 4:
Step 5:

Hypothesize the deterministic component of the


model that relates the mean, E(y), to the
independent variable x.
Use the sample data to estimate unknown
parameters in the model.
Specify the probability distribution of the
random error and estimate the standard
deviation of this distribution.
Statistically evaluate the usefulness of the
model.
When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.

Scattergram
1. Plot of all (xi, yi) pairs
2. Suggests how well model will fit

60
40
20
0

20

40

x
60

Thinking Challenge
How would you draw a line through the points?
How do you determine which line fits best?

60
40
20
0

20

40

x
60

Least Squares Line


The least squares line
is one that has
the following two properties:
1. The sum of the errors equals 0,
i.e., mean error = 0.
2. The sum of squared errors (SSE) is smaller than
for any other straight-line model, i.e., the error
variance is minimum.

Interpreting the Estimates of 0 and


1 in Simple Liner Regression
y-intercept:
represents the predicted value of y
when x = 0 (Caution: This value will not
be meaningful if the value x = 0 is
nonsensical or outside the range of the
sample data.)
slope:
represents the increase (or decrease) in y
for every 1-unit increase in x (Caution:
This interpretation is valid only for
x-values within the range of the sample
data.)

MiniTab
Dependent
Independent

Least Squares Example


Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the least squares line relating
crop yield and fertilizer.

1984-1994 T/Maker Co.

Scattergram
Crop Yield vs. Fertilizer*
Stat -> Regression -> Fitted Line Plot

Coefficient Interpretation
Solution
y

.8 .65 x

1. Slope ( 1)
Crop Yield (y) is expected to increase by .65 lb. for
each 1 lb. increase
in Fertilizer (x)
^

2. y-Intercept ( 0)
Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation.

Five-Step Procedure
Step 1:

Step 2:
Step 3:

Step 4:
Step 5:

Hypothesize the deterministic component of the


model that relates the mean, E(y), to the
independent variable x.
Use the sample data to estimate unknown
parameters in the model.
Specify the probability distribution of the
random error and estimate the standard
deviation of this distribution.
Statistically evaluate the usefulness of the
model.
When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.

Basic Assumptions of the


Probability Distribution

Five-Step Procedure
Step 1:

Step 2:
Step 3:

Step 4:
Step 5:

Hypothesize the deterministic component of the


model that relates the mean, E(y), to the
independent variable x.
Use the sample data to estimate unknown
parameters in the model.
Specify the probability distribution of the
random error and estimate the standard
deviation of this distribution.
Statistically evaluate the usefulness of the
model.
When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.

A Test of Model Utility: Simple


Linear Regression
One-Tailed Test
H0: 1 = 0
Ha: 1 < 0 (or Ha:

> 0)

Test Statistic: t
Rejection region: t < t (or t > t when Ha: 1 > 0)
where t is based on (n 2) degrees of freedom

A Test of Model Utility: Simple


Linear Regression
Two-Tailed Test
H0: 1 = 0
Ha: 1 0
Test Statistic: t
Rejection region: | t | > t
where t is based on (n 2) degrees of freedom

Interpreting p-Values for


Coefficients in Regression
Almost all statistical computer software packages
report a two-tailed p-value for each of the
parameters in the regression model. For example,
in simple linear regression, the p-value for the twotailed test H0: 1 = 0 versus Ha: 1 0 is given on
the printout. If you want to conduct a one-tailed
test of hypothesis, you will need to adjust the
p-value accordingly.

Test of Slope Coefficient


Example
Youre a marketing analyst for Hasbro Toys.
^
^
You find 0 = .1, 1 = .7 and s = .6055.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Is the relationship significant
at the .05 level of significance?

Test of Slope Coefficient


Solution
H0:

=0

if slop ( 1) is zero then there is no relationship

Ha:

This is the claim, there is a relationship because slop is not


zero.

Test of Slope Coefficient


Solution

Calc -> Probability Distributions -> t

H0:
Ha:

=0
0
1

.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025

Reject H0
.025

Inverse Cumulative Distribution Function


Student's t distribution with 3 DF

-3.182

0 3.182

P( X <= x )

0.025

-3.18245

Test of Slope Coefficient


Computer Output

Stat -> Regression -> General Regression


General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)

Regression Equation
Sales (Units)

-0.1 + 0.7 Ad Expenditure (100$)

Coefficients

Term
Constant

Ad Expenditure (100$)

Coef

SE Coef

-0.1

0.635085

-0.15746

0.885

0.7

0.191485

3.65563

0.035

Summary of Model
S = 0.605530

R-Sq = 81.67%

PRESS = 4.43367

R-Sq(pred) = 26.11%

R-Sq(adj) = 75.56%

Analysis of Variance

Source
Regression
Ad Expenditure (100$)

P-Value
DF

Seq SS

Adj SS

Adj MS

4.9

4.9

4.90000

13.3636

0.0353528

4.9

4.9

4.90000

13.3636

0.0353528

Test of Slope Coefficient


Solution

H0:
Ha:

Test Statistic:
t 3.657

=0
0
1

.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025
-3.182

Reject H0

Decision:
Reject H0 at

= .05

because t >
because P-value is smaller than .

.025
0 3.182

P-Value = 0.035

Conclusion:
There is evidence of a
relationship

Correlation Models

Answers How strong is the linear


relationship between two variables?
Coefficient of correlation

Sample correlation coefficient denoted r

Population correlation coefficient

Values range from 1 to +1

Coefficient of Correlation

Coefficient of Correlation

Coefficient of Correlation

Coefficient of Correlation
Example
Youre a marketing analyst for Hasbro Toys.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Calculate the coefficient of
correlation.

Coefficient of Correlation
Solution
Stat -> Regression -> Fitted Line Plot

r=
r=

r = 0.9038805
r

0.904

0.904 -- Strong Positive Relation between x and y

Coefficient of Correlation
Example
Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the coefficient of correlation.

1984-1994 T/Maker Co.

Coefficient of Correlation
Solution
Stat -> Regression -> Fitted Line Plot

r=
r=

0.956

Coefficient of Determination
It represents the proportion of the total sample
variability around y that is explained by the linear
relationship between y and x.

r2 = (coefficient of correlation)2

r2

Coefficient of
Determination Example
Youre a marketing analyst for Hasbro Toys.
You know r = .904.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Calculate and interpret the
coefficient of determination.

Coefficient of
Determination Solution
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Interpretation: About 81.7% of the sample variation
in Sales (y) can be explained by using Ad $ (x) to
predict Sales (y) in the linear model. The remaining
18.3% are due to other factors.

Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters

3. Specify probability distribution of random error


term
Estimate standard deviation of error

4. Evaluate model
5. Use model for prediction and estimation

Prediction With Regression


Models

Types of predictions

Point estimates
Interval estimates

What is predicted

Population mean value of y, E(y), for given x


(confidence interval)
Individual response (yi) for given x
(prediction interval)

Confidence Interval
Estimate Example
Youre a marketing analyst for Hasbro Toys.
You find ^0 = .1, ^ 1 = .7 and s = .6055.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Find a 95% confidence interval for
the mean sales when advertising is $4.

Prediction Interval Solution

Interval Estimate
Computer Output
General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units)

-0.1 + 0.7 Ad Expenditure (100$)

Coefficients
Term
Constant
Ad Expenditure (100$)

Coef
-0.1
0.7

SE Coef
0.635085
0.191485

T
-0.15746
3.65563

P
0.885
0.035

Summary of Model
S = 0.605530
PRESS = 4.43367

R-Sq = 81.67%
R-Sq(pred) = 26.11%

R-Sq(adj) = 75.56%

Analysis of Variance
Source
Regression
Ad Expenditure (100$)
Error
Total

DF
1
1
3
4

Seq SS
4.9
4.9
1.1
6.0

Adj SS
4.9
4.9
1.1

Adj MS
4.90000
4.90000
0.36667

F
13.3636
13.3636

P
0.0353528
0.0353528

Fits and Diagnostics for Unusual Observations

No unusual observations

Predicted Values for New Observations


New Obs
1

Fit
2.7

SE Fit
0.331662

95% CI
(1.64450, 3.75550)

Values of Predictors for New Observations

New Obs
1

Ad Expenditure
(100$)
4

95% PI
(0.502806, 4.89719)

Interval Estimate
Computer Output
Fits and Diagnostics for Unusual Observations
No unusual observations

Predicted Values for New Observations


New Obs
1

Fit
2.7

SE Fit
0.331662

95% CI
(1.64450, 3.75550)

Values of Predictors for New Observations

New Obs
1

Predicted y
when x = 4

Ad Expenditure
(100$)
4

SY^

Confidence
Interval

95% PI
(0.502806, 4.89719)

Prediction Interval
Example
Youre a marketing analyst for Hasbro Toys.
You find ^0 = .1, ^ 1 = .7 and s = .6055.
Ad Expenditure (1000$)
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Predict the sales when advertising
is $400. Use a 95% prediction interval.

Prediction Interval Solution

Interval Estimate
Computer Output
General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units)

-0.1 + 0.7 Ad Expenditure (100$)

Coefficients
Term
Constant
Ad Expenditure (100$)

Coef
-0.1
0.7

SE Coef
0.635085
0.191485

T
-0.15746
3.65563

P
0.885
0.035

Summary of Model
S = 0.605530
PRESS = 4.43367

R-Sq = 81.67%
R-Sq(pred) = 26.11%

R-Sq(adj) = 75.56%

Analysis of Variance
Source
Regression
Ad Expenditure (100$)
Error
Total

DF
1
1
3
4

Seq SS
4.9
4.9
1.1
6.0

Adj SS
4.9
4.9
1.1

Adj MS
4.90000
4.90000
0.36667

F
13.3636
13.3636

P
0.0353528
0.0353528

Fits and Diagnostics for Unusual Observations

No unusual observations

Predicted Values for New Observations


New Obs
1

Fit
2.7

SE Fit
0.331662

95% CI
(1.64450, 3.75550)

Values of Predictors for New Observations

New Obs
1

Ad Expenditure
(100$)
4

95% PI
(0.502806, 4.89719)

Interval Estimate
Computer Output
Fits and Diagnostics for Unusual Observations
No unusual observations

Predicted Values for New Observations


New Obs
1

Fit
2.7

SE Fit
0.331662

95% CI
(1.64450, 3.75550)

95% PI
(0.502806, 4.89719)

Values of Predictors for New Observations

New Obs
1

Predicted y
when x = 4

Ad Expenditure
(100$)
4

SY^

Prediction
Interval

Confidence Intervals v.
Prediction Intervals
The prediction interval is always wider than
the corresponding confidence interval
Added uncertainty involved in predicting a single
response versus the mean response

Confidence Intervals v.
Prediction Intervals
y

x
x

Example
Suppose a fire insurance company wants to relate
the amount of fire damage in major residential
fires to the distance between the burning house
and the nearest fire station. The study is to be
conducted in a large suburb of a major city; a
sample of 15 recent fires in this suburb is
selected. The amount of damage, y, and the
distance between the fire and the nearest fire
station, x, are recorded for each fire.

Example

Example
Step 1: First, we hypothesize a model to relate
fire damage, y, to the distance from the nearest
fire station, x. We hypothesize a straight-line
probabilistic model:
y=

1x

Example
Step 2: Use a statistical software package to
estimate the unknown parameters in the
deterministic component of the hypothesized
model. The least squares estimates of the slope
and intercept 0, highlighted on the printout, are
1

Example
General Regression Analysis: DAMAGE versus DISTANCE
Regression Equation
DAMAGE

10.2779 + 4.91933 DISTANCE

Least Square Equation

Coefficients
Term
Constant
DISTANCE

Coef
10.2779
4.9193

SE Coef
1.42028
0.39275

T
7.2366
12.5254

P
0.000
0.000

Summary of Model
S = 2.31635
PRESS = 93.2117

R-Sq = 92.35%
R-Sq(pred) = 89.77%

R-Sq(adj) = 91.76%

Analysis of Variance
Source
Regression
DISTANCE
Error
Total

DF
1
1
13
14

Seq SS
841.766
841.766
69.751
911.517

Adj SS
841.766
841.766
69.751

Adj MS
841.766
841.766
5.365

F
156.886
156.886

Fits and Diagnostics for Unusual Observations

No unusual observations

P
0.0000000
0.0000000

Example
This prediction equation is graphed in the
Minitab Fitted Line Plot.

Example
The least squares estimate of the slope, 1
implies that the estimated mean damage increases
by $4,919 for each additional mile from the fire
station. This interpretation is valid over the range
of x, or from .7 to 6.1 miles from the station. The
estimated y-intercept, 0
has no
practical interpretation because x = 0 is outside
the sampled range.

Example
Step 3: Specify the probability distribution of the
random error component . The estimate of the
standard deviation of , is

s = 2.31635
This implies that 95% of the observed fire
damage (y) values will fall within approximately
2 = 4.64 thousand dollars of their respective
predicted values when using the least squares
line.

Example
Step 4: First, test the null hypothesis that the
slope 1 is 0 that is, that there is no linear
relationship between fire damage and the
distance from the nearest fire station, against the
alternative hypothesis that fire damage increases
as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for
testing is approximately 0. Dividing by 2, p-value
is also approximately 0. (P < reject H0)

Example
The 95% confidence interval yields (4.070, 5.768).
We estimate (with 95% confidence) that the
interval from $4,070 to $5,768 encloses the mean
increase ( 1) in fire damage per additional mile
distance from the fire station.
The coefficient of determination, is r2 = .9235,
which implies that about 92% of the sample
variation in fire damage (y) is explained by the
distance (x) between the fire and the fire station.

Example
The coefficient of correlation, r, that measures the
strength of the linear relationship between y and x
must be calculated:
r
r2
.9235 .96
The high correlation confirms our conclusion that
1 is greater than 0; it appears that fire damage and
distance from the fire station are positively
correlated. All signs point to a strong linear
relationship between y and x.

Example
Step 5: We are now prepared to use the least
squares model. Suppose the insurance company
wants to predict the fire damage if a major
residential fire were to occur 3.5 miles from the
nearest fire station. A 95% confidence interval for
E(y) and prediction interval for y when x = 3.5 are
shown on the Minitab printout on the next slide.

Example
Step 5: We are now prepared to use the least

Example
The predicted value (highlighted on the printout) is
, while the 95% prediction interval
(also highlighted) is (22.3239, 32.6672).
Therefore, with 95% confidence we predict fire
damage in a major residential fire 3.5 miles from
the nearest station to be between $22,324 and
$32,667.

Key Ideas
Simple Linear Regression Variables
y = Dependent variable (quantitative)
x = Independent variable (quantitative)

Method of Least Squares Properties


1. average error of prediction = 0
2. sum of squared errors is minimum

Key Ideas
Practical Interpretation of y-intercept
predicted y value when x = 0
(no practical interpretation if x = 0 is either nonsensical
or outside range of sample data)

Practical Interpretation of Slope


Increase or decrease in y for every 1-unit increase in x

Key Ideas
First-Order (Straight Line) Model
E(y) =

1x

where E(y) = mean of y


0

= y-intercept of line (point where line


intercepts the y-axis)

= slope of line (change in y for every 1-unit


change in x)

Key Ideas
Coefficient of Correlation, r
1. Ranges between 1 and 1
2. Measures strength of linear relationship between y
and x

Coefficient of Determination, r2
1. Ranges between 0 and 1
2. Measures proportion of sample variation in y
explained by the model

Key Ideas
Practical Interpretation of Model
Standard Deviation, s
Ninety-five percent of y-values fall within 2s of
their respected predicted values
Width of confidence interval for E(y) will
always be narrower than width of prediction
interval for y

S-ar putea să vă placă și