Week 6 in Class Lecture

Week 6
Simple Linear Regression
Models
Representation of some phenomenon

Mathematical model is a mathematical
expression of some phenomenon
Often describe relationships between
variables
Types
Deterministic models
Probabilistic models
Deterministic Models
Hypothesize exact relationships

Suitable when prediction error is negligible
Example: force is exactly mass times
acceleration
F = ma
1984-1994 T/Maker Co.
Probabilistic Models
Hypothesize two components

Deterministic
Random error
Example: sales volume (y) is 10 times

advertising spending (x) + random error
y = 10x +
Random error may be due to factors
other than advertising
General Form of Probabilistic

Models
y = Deterministic component + Random error
where y is the variable of interest.
We always assume that the mean value of the
random error equals 0:
E(y) = Deterministic component
A First-Order (Straight Line)

Probabilistic Model
y=
1x
where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)
E(y) = 0 + 1x = Deterministic component
(epsilon) = Random error component

Probabilistic Model
y=
1x
(beta zero) = y-intercept of the line, that is, the

point at which the line intercepts
or cuts through the y-axis
1 (beta one) = slope of the line, that is, the
change (amount of increase or
decrease) in the deterministic
component of y for every 1-unit
increase in x
0

Probabilistic Model
A positive slope implies that E(y) increases by the
amount 1 for each unit increase in x.
A negative slope implies that E(y) decreases by
the amount 1.
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Hypothesize the deterministic component of the

model that relates the mean, E(y), to the
independent variable x.
Use the sample data to estimate unknown
parameters in the model.
Specify the probability distribution of the
random error and estimate the standard
deviation of this distribution.
Statistically evaluate the usefulness of the
model.
When satisfied that the model is useful, use it for
prediction, estimation, and other purposes.
Scattergram
1. Plot of all (xi, yi) pairs
2. Suggests how well model will fit
60
40
20
0
20
40
x
60
Thinking Challenge
How would you draw a line through the points?
How do you determine which line fits best?
60
40
20
0
20
40
x
60
Least Squares Line

The least squares line
is one that has
the following two properties:
1. The sum of the errors equals 0,
i.e., mean error = 0.
2. The sum of squared errors (SSE) is smaller than
for any other straight-line model, i.e., the error
variance is minimum.
Interpreting the Estimates of 0 and

1 in Simple Liner Regression
y-intercept:
represents the predicted value of y
when x = 0 (Caution: This value will not
be meaningful if the value x = 0 is
nonsensical or outside the range of the
sample data.)
slope:
represents the increase (or decrease) in y
for every 1-unit increase in x (Caution:
This interpretation is valid only for
x-values within the range of the sample
data.)
MiniTab
Dependent
Independent
Least Squares Example

Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the least squares line relating
crop yield and fertilizer.
1984-1994 T/Maker Co.
Scattergram
Crop Yield vs. Fertilizer*
Stat -> Regression -> Fitted Line Plot
Coefficient Interpretation
Solution
y
.8 .65 x
1. Slope ( 1)
Crop Yield (y) is expected to increase by .65 lb. for
each 1 lb. increase
in Fertilizer (x)
^
2. y-Intercept ( 0)
Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation.
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:

model.
Basic Assumptions of the

Probability Distribution
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:

model.
A Test of Model Utility: Simple

Linear Regression
One-Tailed Test
H0: 1 = 0
Ha: 1 < 0 (or Ha:
> 0)
Test Statistic: t
Rejection region: t < t (or t > t when Ha: 1 > 0)
where t is based on (n 2) degrees of freedom
A Test of Model Utility: Simple

Linear Regression
Two-Tailed Test
H0: 1 = 0
Ha: 1 0
Test Statistic: t
Rejection region: | t | > t
where t is based on (n 2) degrees of freedom
Interpreting p-Values for

Coefficients in Regression
Almost all statistical computer software packages
report a two-tailed p-value for each of the
parameters in the regression model. For example,
in simple linear regression, the p-value for the twotailed test H0: 1 = 0 versus Ha: 1 0 is given on
the printout. If you want to conduct a one-tailed
test of hypothesis, you will need to adjust the
p-value accordingly.
Test of Slope Coefficient

Example
Youre a marketing analyst for Hasbro Toys.
^
^
You find 0 = .1, 1 = .7 and s = .6055.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Is the relationship significant
at the .05 level of significance?

Solution
H0:
=0
if slop ( 1) is zero then there is no relationship
Ha:
This is the claim, there is a relationship because slop is not

zero.

Solution
Calc -> Probability Distributions -> t
H0:
Ha:
=0
0
1
.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025
Reject H0
.025
Inverse Cumulative Distribution Function

Student's t distribution with 3 DF
-3.182
0 3.182
P( X <= x )
0.025
-3.18245

Computer Output
Stat -> Regression -> General Regression

General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units)
-0.1 + 0.7 Ad Expenditure (100$)
Coefficients
Term
Constant
Ad Expenditure (100$)
Coef
SE Coef
-0.1
0.635085
-0.15746
0.885
0.7
0.191485
3.65563
0.035
Summary of Model
S = 0.605530
R-Sq = 81.67%
PRESS = 4.43367
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Analysis of Variance
Source
Regression
P-Value
DF
Seq SS
Adj SS
Adj MS
4.9
4.9
4.90000
13.3636
0.0353528
4.9
4.9
4.90000
13.3636
0.0353528

Solution
H0:
Ha:
Test Statistic:
t 3.657
=0
0
1
.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025
-3.182
Reject H0
Decision:
Reject H0 at
= .05
because t >
because P-value is smaller than .
.025
0 3.182
P-Value = 0.035
Conclusion:
There is evidence of a
relationship
Correlation Models
Answers How strong is the linear

relationship between two variables?
Coefficient of correlation
Sample correlation coefficient denoted r
Population correlation coefficient
Values range from 1 to +1
Coefficient of Correlation
Example
1
1
2
1
3
2
4
2
5
4
Calculate the coefficient of
correlation.
Solution
r=
r=
r = 0.9038805
r
0.904
0.904 -- Strong Positive Relation between x and y
Example
Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the coefficient of correlation.
1984-1994 T/Maker Co.
Solution
r=
r=
0.956
Coefficient of Determination
It represents the proportion of the total sample
variability around y that is explained by the linear
relationship between y and x.
r2 = (coefficient of correlation)2
r2
Coefficient of
Determination Example
You know r = .904.
1
1
2
1
3
2
4
2
5
4
Calculate and interpret the
coefficient of determination.
Coefficient of
Determination Solution
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Interpretation: About 81.7% of the sample variation
in Sales (y) can be explained by using Ad $ (x) to
predict Sales (y) in the linear model. The remaining
18.3% are due to other factors.
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
3. Specify probability distribution of random error

term
Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Prediction With Regression

Models
Types of predictions
Point estimates
Interval estimates
What is predicted
Population mean value of y, E(y), for given x

(confidence interval)
Individual response (yi) for given x
(prediction interval)
Confidence Interval
Estimate Example
You find ^0 = .1, ^ 1 = .7 and s = .6055.
1
1
2
1
3
2
4
2
5
4
Find a 95% confidence interval for
the mean sales when advertising is $4.
Prediction Interval Solution
Interval Estimate
Computer Output
Regression Equation
Sales (Units)
Coefficients
Term
Constant
Coef
-0.1
0.7
SE Coef
0.635085
0.191485
T
-0.15746
3.65563
P
0.885
0.035
Summary of Model
S = 0.605530
PRESS = 4.43367
R-Sq = 81.67%
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Source
Regression
Error
Total
DF
1
1
3
4
Seq SS
4.9
4.9
1.1
6.0
Adj SS
4.9
4.9
1.1
Adj MS
4.90000
4.90000
0.36667
F
13.3636
13.3636
P
0.0353528
0.0353528
Fits and Diagnostics for Unusual Observations
No unusual observations
Predicted Values for New Observations

New Obs
1
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
Values of Predictors for New Observations
New Obs
1
Ad Expenditure
(100$)
4
95% PI
(0.502806, 4.89719)
Interval Estimate
Computer Output

New Obs
1
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
New Obs
1
Predicted y
when x = 4
Ad Expenditure
(100$)
4
SY^
Confidence
Interval
95% PI
(0.502806, 4.89719)
Prediction Interval
Example
You find ^0 = .1, ^ 1 = .7 and s = .6055.
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Predict the sales when advertising
is $400. Use a 95% prediction interval.
Prediction Interval Solution
Interval Estimate
Computer Output
Regression Equation
Sales (Units)
Coefficients
Term
Constant
Coef
-0.1
0.7
SE Coef
0.635085
0.191485
T
-0.15746
3.65563
P
0.885
0.035
Summary of Model
S = 0.605530
PRESS = 4.43367
R-Sq = 81.67%
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Source
Regression
Error
Total
DF
1
1
3
4
Seq SS
4.9
4.9
1.1
6.0
Adj SS
4.9
4.9
1.1
Adj MS
4.90000
4.90000
0.36667
F
13.3636
13.3636
P
0.0353528
0.0353528

New Obs
1
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
New Obs
1
Ad Expenditure
(100$)
4
95% PI
(0.502806, 4.89719)
Interval Estimate
Computer Output

New Obs
1
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
95% PI
(0.502806, 4.89719)
New Obs
1
Predicted y
when x = 4
Ad Expenditure
(100$)
4
SY^
Prediction
Interval
Confidence Intervals v.
Prediction Intervals
The prediction interval is always wider than
the corresponding confidence interval
Added uncertainty involved in predicting a single
response versus the mean response
Confidence Intervals v.
Prediction Intervals
y
x
x
Example
Suppose a fire insurance company wants to relate
the amount of fire damage in major residential
fires to the distance between the burning house
and the nearest fire station. The study is to be
conducted in a large suburb of a major city; a
sample of 15 recent fires in this suburb is
selected. The amount of damage, y, and the
distance between the fire and the nearest fire
station, x, are recorded for each fire.
Example
Example
Step 1: First, we hypothesize a model to relate
fire damage, y, to the distance from the nearest
fire station, x. We hypothesize a straight-line
probabilistic model:
y=
1x
Example
Step 2: Use a statistical software package to
estimate the unknown parameters in the
deterministic component of the hypothesized
model. The least squares estimates of the slope
and intercept 0, highlighted on the printout, are
1
Example
General Regression Analysis: DAMAGE versus DISTANCE
Regression Equation
DAMAGE
10.2779 + 4.91933 DISTANCE
Least Square Equation
Coefficients
Term
Constant
DISTANCE
Coef
10.2779
4.9193
SE Coef
1.42028
0.39275
T
7.2366
12.5254
P
0.000
0.000
Summary of Model
S = 2.31635
PRESS = 93.2117
R-Sq = 92.35%
R-Sq(pred) = 89.77%
R-Sq(adj) = 91.76%
Source
Regression
DISTANCE
Error
Total
DF
1
1
13
14
Seq SS
841.766
841.766
69.751
911.517
Adj SS
841.766
841.766
69.751
Adj MS
841.766
841.766
5.365
F
156.886
156.886
P
0.0000000
0.0000000
Example
This prediction equation is graphed in the
Minitab Fitted Line Plot.
Example
The least squares estimate of the slope, 1
implies that the estimated mean damage increases
by $4,919 for each additional mile from the fire
station. This interpretation is valid over the range
of x, or from .7 to 6.1 miles from the station. The
estimated y-intercept, 0
has no
practical interpretation because x = 0 is outside
the sampled range.
Example
Step 3: Specify the probability distribution of the
random error component . The estimate of the
standard deviation of , is
s = 2.31635
This implies that 95% of the observed fire
damage (y) values will fall within approximately
2 = 4.64 thousand dollars of their respective
predicted values when using the least squares
line.
Example
Step 4: First, test the null hypothesis that the
slope 1 is 0 that is, that there is no linear
relationship between fire damage and the
distance from the nearest fire station, against the
alternative hypothesis that fire damage increases
as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for
testing is approximately 0. Dividing by 2, p-value
is also approximately 0. (P < reject H0)
Example
The 95% confidence interval yields (4.070, 5.768).
We estimate (with 95% confidence) that the
interval from $4,070 to $5,768 encloses the mean
increase ( 1) in fire damage per additional mile
distance from the fire station.
The coefficient of determination, is r2 = .9235,
which implies that about 92% of the sample
variation in fire damage (y) is explained by the
distance (x) between the fire and the fire station.
Example
The coefficient of correlation, r, that measures the
strength of the linear relationship between y and x
must be calculated:
r
r2
.9235 .96
The high correlation confirms our conclusion that
1 is greater than 0; it appears that fire damage and
distance from the fire station are positively
correlated. All signs point to a strong linear
relationship between y and x.
Example
Step 5: We are now prepared to use the least
squares model. Suppose the insurance company
wants to predict the fire damage if a major
residential fire were to occur 3.5 miles from the
nearest fire station. A 95% confidence interval for
E(y) and prediction interval for y when x = 3.5 are
shown on the Minitab printout on the next slide.
Example
Step 5: We are now prepared to use the least
Example
The predicted value (highlighted on the printout) is
, while the 95% prediction interval
(also highlighted) is (22.3239, 32.6672).
Therefore, with 95% confidence we predict fire
damage in a major residential fire 3.5 miles from
the nearest station to be between $22,324 and
$32,667.
Key Ideas
Simple Linear Regression Variables
y = Dependent variable (quantitative)
x = Independent variable (quantitative)
Method of Least Squares Properties

1. average error of prediction = 0
2. sum of squared errors is minimum
Key Ideas
Practical Interpretation of y-intercept
predicted y value when x = 0
(no practical interpretation if x = 0 is either nonsensical
or outside range of sample data)
Practical Interpretation of Slope

Increase or decrease in y for every 1-unit increase in x
Key Ideas
First-Order (Straight Line) Model
E(y) =
1x
where E(y) = mean of y

0
= y-intercept of line (point where line

intercepts the y-axis)
= slope of line (change in y for every 1-unit

change in x)
Key Ideas
Coefficient of Correlation, r
1. Ranges between 1 and 1
2. Measures strength of linear relationship between y
and x
Coefficient of Determination, r2
1. Ranges between 0 and 1
2. Measures proportion of sample variation in y
explained by the model
Key Ideas
Practical Interpretation of Model
Standard Deviation, s
Ninety-five percent of y-values fall within 2s of
their respected predicted values
Width of confidence interval for E(y) will
always be narrower than width of prediction
interval for y

Week 6 in Class Lecture

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Week 6 in Class Lecture

Încărcat de

Drepturi de autor:

Formate disponibile

Week 6

Simple Linear Regression

Representation of some phenomenon

Hypothesize exact relationships

1984-1994 T/Maker Co.

Hypothesize two components

Example: sales volume (y) is 10 times

General Form of Probabilistic

E(y) = Deterministic component

A First-Order (Straight Line)

A First-Order (Straight Line)

(beta zero) = y-intercept of the line, that is, the

A First-Order (Straight Line)

Hypothesize the deterministic component of the

Least Squares Line

Interpreting the Estimates of 0 and

Least Squares Example

1984-1994 T/Maker Co.

Hypothesize the deterministic component of the

Basic Assumptions of the

Hypothesize the deterministic component of the

A Test of Model Utility: Simple

A Test of Model Utility: Simple

Interpreting p-Values for

Test of Slope Coefficient

Test of Slope Coefficient

if slop ( 1) is zero then there is no relationship

This is the claim, there is a relationship because slop is not

Test of Slope Coefficient

Calc -> Probability Distributions -> t

Inverse Cumulative Distribution Function

Test of Slope Coefficient

Stat -> Regression -> General Regression

-0.1 + 0.7 Ad Expenditure (100$)

Test of Slope Coefficient

Answers How strong is the linear

Sample correlation coefficient denoted r

Population correlation coefficient

Values range from 1 to +1

0.904 -- Strong Positive Relation between x and y

1984-1994 T/Maker Co.

3. Specify probability distribution of random error

Prediction With Regression

Population mean value of y, E(y), for given x

Prediction Interval Solution

-0.1 + 0.7 Ad Expenditure (100$)

Fits and Diagnostics for Unusual Observations

Predicted Values for New Observations

Values of Predictors for New Observations

Predicted Values for New Observations

Values of Predictors for New Observations

Prediction Interval Solution

-0.1 + 0.7 Ad Expenditure (100$)

Fits and Diagnostics for Unusual Observations

Predicted Values for New Observations

Values of Predictors for New Observations

Predicted Values for New Observations

Values of Predictors for New Observations

10.2779 + 4.91933 DISTANCE

Least Square Equation

Fits and Diagnostics for Unusual Observations

Method of Least Squares Properties

Practical Interpretation of Slope

where E(y) = mean of y