Documente Academic
Documente Profesional
Documente Cultură
Models
Deterministic models
Probabilistic models
Deterministic Models
F = ma
Probabilistic Models
1x
where
y = Dependent or response variable
(variable to be modeled)
x = Independent or predictor variable
(variable used as a predictor of y)
E(y) = 0 + 1x = Deterministic component
(epsilon) = Random error component
1x
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Scattergram
1. Plot of all (xi, yi) pairs
2. Suggests how well model will fit
60
40
20
0
20
40
x
60
Thinking Challenge
How would you draw a line through the points?
How do you determine which line fits best?
60
40
20
0
20
40
x
60
MiniTab
Dependent
Independent
Scattergram
Crop Yield vs. Fertilizer*
Stat -> Regression -> Fitted Line Plot
Coefficient Interpretation
Solution
y
.8 .65 x
1. Slope ( 1)
Crop Yield (y) is expected to increase by .65 lb. for
each 1 lb. increase
in Fertilizer (x)
^
2. y-Intercept ( 0)
Since 0 is outside of the range of the sampled
values of x, the y-intercept has no meaningful
interpretation.
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Five-Step Procedure
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
> 0)
Test Statistic: t
Rejection region: t < t (or t > t when Ha: 1 > 0)
where t is based on (n 2) degrees of freedom
=0
Ha:
H0:
Ha:
=0
0
1
.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025
Reject H0
.025
-3.182
0 3.182
P( X <= x )
0.025
-3.18245
Regression Equation
Sales (Units)
Coefficients
Term
Constant
Ad Expenditure (100$)
Coef
SE Coef
-0.1
0.635085
-0.15746
0.885
0.7
0.191485
3.65563
0.035
Summary of Model
S = 0.605530
R-Sq = 81.67%
PRESS = 4.43367
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Analysis of Variance
Source
Regression
Ad Expenditure (100$)
P-Value
DF
Seq SS
Adj SS
Adj MS
4.9
4.9
4.90000
13.3636
0.0353528
4.9
4.9
4.90000
13.3636
0.0353528
H0:
Ha:
Test Statistic:
t 3.657
=0
0
1
.05
df 5 2 = 3
Critical Value(s):
Reject H0
.025
-3.182
Reject H0
Decision:
Reject H0 at
= .05
because t >
because P-value is smaller than .
.025
0 3.182
P-Value = 0.035
Conclusion:
There is evidence of a
relationship
Correlation Models
Coefficient of Correlation
Coefficient of Correlation
Coefficient of Correlation
Coefficient of Correlation
Example
Youre a marketing analyst for Hasbro Toys.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Calculate the coefficient of
correlation.
Coefficient of Correlation
Solution
Stat -> Regression -> Fitted Line Plot
r=
r=
r = 0.9038805
r
0.904
Coefficient of Correlation
Example
Youre an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
Find the coefficient of correlation.
Coefficient of Correlation
Solution
Stat -> Regression -> Fitted Line Plot
r=
r=
0.956
Coefficient of Determination
It represents the proportion of the total sample
variability around y that is explained by the linear
relationship between y and x.
r2 = (coefficient of correlation)2
r2
Coefficient of
Determination Example
Youre a marketing analyst for Hasbro Toys.
You know r = .904.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Calculate and interpret the
coefficient of determination.
Coefficient of
Determination Solution
r2 = (coefficient of correlation)2
r2 = (.904)2
r2 = .817
Interpretation: About 81.7% of the sample variation
in Sales (y) can be explained by using Ad $ (x) to
predict Sales (y) in the linear model. The remaining
18.3% are due to other factors.
Regression Modeling
Steps
1. Hypothesize deterministic component
2. Estimate unknown model parameters
4. Evaluate model
5. Use model for prediction and estimation
Types of predictions
Point estimates
Interval estimates
What is predicted
Confidence Interval
Estimate Example
Youre a marketing analyst for Hasbro Toys.
You find ^0 = .1, ^ 1 = .7 and s = .6055.
Ad Expenditure (100$) Sales (Units)
1
1
2
1
3
2
4
2
5
4
Find a 95% confidence interval for
the mean sales when advertising is $4.
Interval Estimate
Computer Output
General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units)
Coefficients
Term
Constant
Ad Expenditure (100$)
Coef
-0.1
0.7
SE Coef
0.635085
0.191485
T
-0.15746
3.65563
P
0.885
0.035
Summary of Model
S = 0.605530
PRESS = 4.43367
R-Sq = 81.67%
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Analysis of Variance
Source
Regression
Ad Expenditure (100$)
Error
Total
DF
1
1
3
4
Seq SS
4.9
4.9
1.1
6.0
Adj SS
4.9
4.9
1.1
Adj MS
4.90000
4.90000
0.36667
F
13.3636
13.3636
P
0.0353528
0.0353528
No unusual observations
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
New Obs
1
Ad Expenditure
(100$)
4
95% PI
(0.502806, 4.89719)
Interval Estimate
Computer Output
Fits and Diagnostics for Unusual Observations
No unusual observations
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
New Obs
1
Predicted y
when x = 4
Ad Expenditure
(100$)
4
SY^
Confidence
Interval
95% PI
(0.502806, 4.89719)
Prediction Interval
Example
Youre a marketing analyst for Hasbro Toys.
You find ^0 = .1, ^ 1 = .7 and s = .6055.
Ad Expenditure (1000$)
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Predict the sales when advertising
is $400. Use a 95% prediction interval.
Interval Estimate
Computer Output
General Regression Analysis: Sales (Units) versus Ad Expenditure (100$)
Regression Equation
Sales (Units)
Coefficients
Term
Constant
Ad Expenditure (100$)
Coef
-0.1
0.7
SE Coef
0.635085
0.191485
T
-0.15746
3.65563
P
0.885
0.035
Summary of Model
S = 0.605530
PRESS = 4.43367
R-Sq = 81.67%
R-Sq(pred) = 26.11%
R-Sq(adj) = 75.56%
Analysis of Variance
Source
Regression
Ad Expenditure (100$)
Error
Total
DF
1
1
3
4
Seq SS
4.9
4.9
1.1
6.0
Adj SS
4.9
4.9
1.1
Adj MS
4.90000
4.90000
0.36667
F
13.3636
13.3636
P
0.0353528
0.0353528
No unusual observations
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
New Obs
1
Ad Expenditure
(100$)
4
95% PI
(0.502806, 4.89719)
Interval Estimate
Computer Output
Fits and Diagnostics for Unusual Observations
No unusual observations
Fit
2.7
SE Fit
0.331662
95% CI
(1.64450, 3.75550)
95% PI
(0.502806, 4.89719)
New Obs
1
Predicted y
when x = 4
Ad Expenditure
(100$)
4
SY^
Prediction
Interval
Confidence Intervals v.
Prediction Intervals
The prediction interval is always wider than
the corresponding confidence interval
Added uncertainty involved in predicting a single
response versus the mean response
Confidence Intervals v.
Prediction Intervals
y
x
x
Example
Suppose a fire insurance company wants to relate
the amount of fire damage in major residential
fires to the distance between the burning house
and the nearest fire station. The study is to be
conducted in a large suburb of a major city; a
sample of 15 recent fires in this suburb is
selected. The amount of damage, y, and the
distance between the fire and the nearest fire
station, x, are recorded for each fire.
Example
Example
Step 1: First, we hypothesize a model to relate
fire damage, y, to the distance from the nearest
fire station, x. We hypothesize a straight-line
probabilistic model:
y=
1x
Example
Step 2: Use a statistical software package to
estimate the unknown parameters in the
deterministic component of the hypothesized
model. The least squares estimates of the slope
and intercept 0, highlighted on the printout, are
1
Example
General Regression Analysis: DAMAGE versus DISTANCE
Regression Equation
DAMAGE
Coefficients
Term
Constant
DISTANCE
Coef
10.2779
4.9193
SE Coef
1.42028
0.39275
T
7.2366
12.5254
P
0.000
0.000
Summary of Model
S = 2.31635
PRESS = 93.2117
R-Sq = 92.35%
R-Sq(pred) = 89.77%
R-Sq(adj) = 91.76%
Analysis of Variance
Source
Regression
DISTANCE
Error
Total
DF
1
1
13
14
Seq SS
841.766
841.766
69.751
911.517
Adj SS
841.766
841.766
69.751
Adj MS
841.766
841.766
5.365
F
156.886
156.886
No unusual observations
P
0.0000000
0.0000000
Example
This prediction equation is graphed in the
Minitab Fitted Line Plot.
Example
The least squares estimate of the slope, 1
implies that the estimated mean damage increases
by $4,919 for each additional mile from the fire
station. This interpretation is valid over the range
of x, or from .7 to 6.1 miles from the station. The
estimated y-intercept, 0
has no
practical interpretation because x = 0 is outside
the sampled range.
Example
Step 3: Specify the probability distribution of the
random error component . The estimate of the
standard deviation of , is
s = 2.31635
This implies that 95% of the observed fire
damage (y) values will fall within approximately
2 = 4.64 thousand dollars of their respective
predicted values when using the least squares
line.
Example
Step 4: First, test the null hypothesis that the
slope 1 is 0 that is, that there is no linear
relationship between fire damage and the
distance from the nearest fire station, against the
alternative hypothesis that fire damage increases
as the distance increases. We test
H0: 1 = 0
Ha: 1 > 0
The two-tailed observed significance level for
testing is approximately 0. Dividing by 2, p-value
is also approximately 0. (P < reject H0)
Example
The 95% confidence interval yields (4.070, 5.768).
We estimate (with 95% confidence) that the
interval from $4,070 to $5,768 encloses the mean
increase ( 1) in fire damage per additional mile
distance from the fire station.
The coefficient of determination, is r2 = .9235,
which implies that about 92% of the sample
variation in fire damage (y) is explained by the
distance (x) between the fire and the fire station.
Example
The coefficient of correlation, r, that measures the
strength of the linear relationship between y and x
must be calculated:
r
r2
.9235 .96
The high correlation confirms our conclusion that
1 is greater than 0; it appears that fire damage and
distance from the fire station are positively
correlated. All signs point to a strong linear
relationship between y and x.
Example
Step 5: We are now prepared to use the least
squares model. Suppose the insurance company
wants to predict the fire damage if a major
residential fire were to occur 3.5 miles from the
nearest fire station. A 95% confidence interval for
E(y) and prediction interval for y when x = 3.5 are
shown on the Minitab printout on the next slide.
Example
Step 5: We are now prepared to use the least
Example
The predicted value (highlighted on the printout) is
, while the 95% prediction interval
(also highlighted) is (22.3239, 32.6672).
Therefore, with 95% confidence we predict fire
damage in a major residential fire 3.5 miles from
the nearest station to be between $22,324 and
$32,667.
Key Ideas
Simple Linear Regression Variables
y = Dependent variable (quantitative)
x = Independent variable (quantitative)
Key Ideas
Practical Interpretation of y-intercept
predicted y value when x = 0
(no practical interpretation if x = 0 is either nonsensical
or outside range of sample data)
Key Ideas
First-Order (Straight Line) Model
E(y) =
1x
Key Ideas
Coefficient of Correlation, r
1. Ranges between 1 and 1
2. Measures strength of linear relationship between y
and x
Coefficient of Determination, r2
1. Ranges between 0 and 1
2. Measures proportion of sample variation in y
explained by the model
Key Ideas
Practical Interpretation of Model
Standard Deviation, s
Ninety-five percent of y-values fall within 2s of
their respected predicted values
Width of confidence interval for E(y) will
always be narrower than width of prediction
interval for y