AU 15 Lecture 4 SLR

Lecture #4
REGRESSION
O N E - FA C TO R E X P E R I M E N TA L D E S I G N
“Common sense is the collection of prejudices acquired by age

eighteen.”
----Albert Einstein (1879 – 1955)
Learning Objectives
2
1. Understand how covariance and correlation are used to assess

strength of linear relationship between two variables.
2. Understand the difference between deterministic and

stochastic (statistical) relationships.
3. Understand how regression analysis can be used to develop

an equation that estimates the stochastic (statistical)
relationship between two variables.
4. Know how to fit a simple (first order) linear regression model to

a sample data set using the least squares method.
Learning Objectives
3
5. Know how to interpret the coefficients in the regression

equation.
6. Know how to calculate point estimates of the conditional mean

response and point predictions of the response.
7. Be able to determine the goodness of fit of the estimated

regression model.
8. Know how to read computer output from the regression

analysis.
Terminology
4
Simple Linear Regression

Covariance Correlation
Predictor Variable Independent Variable
Response Variable Dependent Variable
Conditional Mean Residual
Slope Intercept
Sum of Squares Total Sum of Squares Error
Sum of Squares Regression Goodness of Fit
Coefficient of Determination R2
Standard Error of the Estimate
Why Study Regression?
Y, Response variable
Continuous Discrete
(Output has a mean and variance) (Output is a proportion, i.e., 15 out of 50 or 30%)
- Correlation -Logistics Regression

Continuous - Regression*
X, Input(s) - T-test - Proportions

- Paired
- One Sample - One Proportion
- Two Sample - Two Proportions
-Equal var.
Discrete
-Unequal var.
- ANOVA (an X with 2 or more means - Chi Sq (more than 2 proportions)
or 2 or more X’s being investigated)
Continuous: data that can be subdivided into smaller and smaller increments, i.e., Time: Weeks, Hours, Minutes, seconds, etc.
Discrete: data that falls into distinct categories, i.e., Gender (Male or Female), Day of Week (Mon, Tues, etc.), University (OSU,MSU, etc.)
* Regression is a tool that uses a one-sample t hypothesis test, constant = 0, to test for significance.
Overview
 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Describing Relationships
7
TWO QUANTITATIVE
VARIABLES
Scatter Plot
8
Scatterplot of Coffee vs Temperature

 Scatterplot 70000
x
35.08
60000
50000
40000
Coffees
30000
y
29913
20000
10000
0
-10 0 10 20 30 40 50 60 70
Temperature
s xy  xi  x  yi  y 

1
 Correlation r    
sx s y n  1  
 s x  s y 
 Equation
"r" ranges
from -1 to +1
r is a sample statistic
that estimates the
population parameter, ρ
Strength: how
closely the points
follow a straight line.
Direction: is positive
when individuals with
higher X values tend
to have higher values
of Y.
9
Application
10
A major newspaper printing facility is interested in

understanding those factors (inputs) that affect the quality
of the printing produced by their equipment (output).
Important inputs?
Regression Analysis
11
Regression Analysis is a technique used to build an
equation that can be used to estimate or predict the value
of one variable by using its relationship with one or more
other variables, i.e., describe the “tendency” or the “FIT,”
with an equation:
 A response (dependent) variable is predicted or estimated:

Y
 One or more predictor or explanatory (independent)

variables: X1, X2, . . , Xp
 y = f(x1, x2, . . , xp)

 E.g., y = a + bx1 + cx2
Simple Linear Regression
12
 Use 1 predictor (input)
 Relationship is first order
 Straight line: y = a + bx
Deterministic Relationship
13
Suppose for every unit of a particular product sold,

profit is exactly $5: Y = $5X.
60
50
40
Profit
30
20
10
0
0 1 2 3 4 5 6 7 8 9 10 11
Units Sold
The First Order Deterministic Equation
14
The value of Y is perfectly determined by the value of X.
y
Slope = b =
x
Y-intercept = a
Stochastic Relationship
15
Uncertainty in inputs (labor, raw materials, time, etc.)

creates uncertainty in profit:
60.0
50.0
40.0
Profit
30.0
20.0
10.0
0.0
0 2 4 6 8 10 12
Units Sold
The First Order Stochastic Equation
16
There is a recognizable relationship between X and Y

(pattern), f(y|x)
but it is not perfect.
yi  a  bxi  ei
Y
x1
x2
X yˆ  a  bx
Exercise
17
Which of the following statements concerning the

coefficient of simple linear correlation, r, is true?
a. r = 0.00 represents the absence of a relationship
b. The relationship between the two variables must be nonlinear
c. r = 0.76 has the same predictive power as r = -0.76
d. If r is negative, as one variable increases, the other will increase
e. Both a. and c. are true
Overview
 Goodness of Fit
 Example
Application
19
SIMPLE LINEAR REGRESSION

Mini-Case: Industrial Printing Equipment
20
A Japanese manufacturing firm interested in

expanding its production of heavy duty printing
equipment conducted research to help identify the
determinants of increased market share. A detailed
study of costs and benefits was conducted. Reverse
engineering, benchmarking, and customer surveys
were used to establish the data base.
Definitions:
 Market Share: Percentage of worldwide market
share as determined by an independent research
firm.
 Cost: Combined depreciation, operating, and

maintenance costs.
 Benefit = Product of 3 factors:
Expected profit from the press  production
speed  expected lifespan of press.
21
The company hopes to better understand
market share by considering how it relates to
the costs and benefits of the additional press:
 The response variable, also called the

dependent or criterion variable, is market
share. This is Y.
 The predictor variable, also called the

explanatory or independent variable, is the
ratio benefit/cost. This is (an) X.
22
The Data
23
Company 1 2 3 4 5 6
Market Share (%) 11 16 31 5 13 24
Benefit/Cost Ratio 1.8 2.3 2.9 1.6 2.0 2.5
Data Source: Shigeru Mizuno, Company-Wide Quality

Control, APO, 1988.
The Scatter Plot
24
Conclusions:
Scatter Plot For the Printing Industry
35
30
Y =Market Share (%)
25
(x5 = 2.0, y5 = 13)
20
15
10
0
0 1 2 3
X = Benefit/Cost Ratio
Overview
 Goodness of Fit
 Example
Key Components
26

The Population Regression Model
27
The first order stochastic model that will be used to

represent the relationship between the dependent
variable Y and the independent variable X:
Yi   0  1 X i   i
Data = Fit + Scatter
y/x
Yi = The observed (true) value of the dependent
variable for observation i.
The Population Regression Model Parameters
28
βo = Y-axis intercept parameter of the regression line

Some texts use the notation α for the intercept term; use them interchangeably.
β1 = The slope parameter of the regression line.

Some texts use the notation β without the subscript for the special case of simple regression.
εi = The error or deviation term of the actual Yi value from the

population regression model.
29
The Least Squares Regression Equation
30
Yî  b0  b1 X i
Yˆ i = The estimated mean value and the predicted value
of the dependent variable. Do not confuse this value with
Yi.
bo = The statistical estimator for β0.
b1 = The statistical estimator for β1.
ei  Yi  Yî = The difference between the predicted value

and the true value. It is called the residual or “error”.
Hypothesis Test
31
Whenever a statistic is used to estimate a parameter,
E{bi} = i,
we want to test to determine if the value is significant
Ho: i = 0
Ha: i = 0
The Residuals
32
Scatterplot of y vs x
30
25
20 ŷ i ei  yi  yˆ i
y
16.67
15
yi
10
1.50 1.75 2.00 2.25 2.50 2.75 3.00

x
The Least Squares Estimates
33
To determine the values for bo and b1, we minimize the

sum of squared errors:
n n
min  (Yi  Yi )  min
ˆ 2
e 2
i  min( e  e    e )
2
1
2
2
2
n
i 1 i 1
Since Yî  b0  b1 X i:
n n
min  i 0 1 i  min
(Y
i 1
 (b  b X )) 2
 i 0 1 i
(Y
i 1
 b  b X ) 2
This equation is called the Least Squares Criterion.

The derivation for the equations for bo and b1 is left as
a homework assignment for the students.
The formulas for bo and b1 are guaranteed to

minimize the sum of squared deviations of the
predicted value to the observed value of Yi. This is a
logical criterion.
Consider the value for the criterion if all predictions

were exactly correct.
34
Formulas for the sample slope and intercept
35
b1 
 ( xi  x )( yi  y ) Cov( x, y )
 r
sy
and b0  y  b1 x
 ( xi  x ) 2 2
sX sx
where
Cov(x, y) 
 ( xi  x )( yi  y ) , s 2X 
 i
( x  x ) 2
, s 2y 
 i
( y  y ) 2
n 1 n 1 n 1
Computatio nal Equations :
b1 
 xi yi  ( xi )( yi ) / n  xi yi  nx y

 x 2
i
 (  i
x ) 2
/ n  x 2
 nxi
2
Computations for Our Case
36
i xi yi ( xi  x ) ( yi  y ) ( xi  x ) ( y i  y ) ( xi  x ) 2 ( yi  y ) 2
1 1.8 11.0 -0.3833 -5.6667 2.1720 0.1469 32.1111
2 2.3 16.0 0.1167 -0.6667 -0.0778 0.0136 0.4444
3 2.9 31.0 0.7167 14.3333 10.2725 0.5136 205.4444
4 1.6 5.0 -0.5833 -11.6667 6.8052 0.3403 136.1111
5 2.0 13.0 -0.1833 -3.6667 0.6721 0.0336 13.4444
6 2.5 24.0 0.3167 7.3333 2.3225 0.1003 53.7778
13.1 100.0 0.00000 0.0000 22.1665 1.14833 441.3332
13.1 100 22.1665 1.14833 441.3332
x y s xy  s x2  s 2y 
6 6 5 5 5
x  2.1833 y  16.6667 s xy  4.4333 s x2  0.22967 s 2y  88.2666
4.4333
r  0.985
0.22967 88.2666
Exercise
37
Ben/Cost MktShr Ben/Cost MktShr
Correlation One Variable Data Set Data Set
Table Data Set #1 Data Set #1 Summary #1 #1
Ben/Cost 1.000 0.985 Mean 2.1833 16.667
MktShr 0.985 1.000
Variance 0.2297 88.267
Ben/Cost MktShr Std. Dev. 0.4792 9.395
Covariance Minimum 1.6000 5.000
Table Data Set #1 Data Set #1
Maximum 2.9000 31.000
Ben/Cost 0.2297 4.4333
MktShr 4.4333 88.2667 Count 6 6
b1= b0 =
Help From Technology
38
Multiple Adjusted StErr of

R-Square
Summary R R-Square Estimate
0.9847 0.9695 0.9619 1.8332
Degrees of Sum of Mean of

F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607
Standard Confidence Interval 95%
Regression Coefficient t-Value p-Value
Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531
The regression equation:

Estimating the Conditional Mean of Y
39
For benefit/cost = 1.92:
The Estimate of the mean market share for all companies

with a certain benefit/cost ratio, y / x 1.92 :
ŷ = − 25.47896 + 19.30334 (1.92) = 11.58%

Interpreting the Coefficients
40
b1:a one unit increase in the benefit/cost ratio is

expected, or predicted, to be accompanied by or
associated with an increase in market share of
19.3%.
b0: a market share of -25.479% is expected, or

predicted, when benefit/cost is 0.
Beware of extrapolation!!!!!
Extrapolation can occur in both directions and can have
very serious consequences:
42
Exercise
43
In the linear regression model: Yi = 0 + lXi + i, the

terms 0 and l represent:
a. The Y intercept and the slope of the population regression line

b. The Y intercept and the slope of the sample regression line
c. Statistics used to estimate the parameters b0 and b1
d. Parameters used to estimate the statistics b0 and b1
e. None of the other choices is correct.
Overview
 Goodness of Fit
 Example
Evaluating the Goodness of Fit
45
HOW ACCURATE WILL

ESTIMATES/PREDICTIONS
BE?
Total Variation
46
Considering the sample values of Market Share

without the information provided by Ben/Cost:
Market Shares: 11 16 31 5 13 24
y = 16.6667 sy2 = 88.26667
sY2 
 ( yi  y ) 2

 ( yi  16.6667 ) 2

441 .3334 SSTotal

n 1 6 1 5 dfT
Note that SSTotal  (n  1)sY

2
Partitioning the SSTotal
24 – 22.78 = 1.22
(2.50, 24%)
22.78
22.78 – 16.67 = 6.11 24 – 16.67 = 7.33
47
SSTotal (SST) =  ( yi  yi ) 2 = 441.33
SSE = Sum of Squares Error = SS Unexplained
 ( yi  yî ) 2 =  ei2 = 13.44
SSR = Sum of Squares Regression = SS Explained
 ( yî  y) 2 = 427.89
SSTotal = SSR + SSE
48
49

R-Square
0.9847 0.9695 0.9619 1.8332

F-Ratio p-Value
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607

Coefficient t-Value p-Value
Regression Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531
The Coefficient of Determination, R2
50
R2 measures explained variation relative to the total:

SSR
R 
2
SST
What if SSE = 0.0?
What if SSR = 0.0?

Then, 0 ≤ R2 ≤ 1, and as R2 increases from 0 toward 1:
In general, R2 can be interpreted as the proportion of

the variation in the response variable (y) that can
be accounted for (predicted, explained) by the
model.
51
The Standard Error of the Model, s
52
We can measure the accuracy of prediction for a

regression equation by examining the residuals.
Error variance is used to measure how closely the

points cluster around the regression line.
Error variance in a regression refers to the variance of

the probability distribution of y for each given value of
x, i.e. variance of the conditional probability
distribution of y given x.
s 2

 ( yi  yi )
ˆ 2

 
2
eiSSE
 ______
n2 n  2 df E
The dfE = n – 2 only in simple linear regression. In

general, dfE = n – (p* + 1).
* Where p is the number of variables you’re regressing with
s
 ( yi  yi )
ˆ 2

SSE
 MSE
n2 n2
53
Interpretation of s
 Compare it to the values of the response variable:
Range: 5 to 31
y = 16.6667
sY2 = 88.2667
 Considering the implication of the Empirical Rule:
In our example, then, the residuals (errors) for

predicting market share should be no larger than
2(1.8337) = 3.67 percentage points 95% of the time.
54
55

R-Square
0.9847 0.9695 0.9619 1.8332

F-Ratio p-Value
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607

Regression Coefficient t-Value p-Value
Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531
Calculating SSE
56
Recall that ŷ i = − 25.47896 + 19.30334 xi.

Exercise
57
Suppose the line ŷ = - 0.5 + 2x has been fitted to the

(x, y) data points (4, 8), (2, 5), and (1, 1). What is the
value of the SSE for these points if the line was used
to predict y?
a. 0
b. 24.667
c. 1.5
d. 2.75
Exercise
58
If the linear relationship between two variables, X and

Y, is deterministic, a least squares regression will
result in which of the following?
I Standard error of the estimate equal to 1
II SSE = SST
III Coefficient of Determination = 1
a. Only I is true.
b. Only II is true.
c. Only III is true.
d. II and III are both true.
e. I, II, and III are all true.
Overview
 Goodness of Fit
 Example
Required Data Conditions
for Inference: εi iid N(0, σ)
60

1. Normality
61
The εi terms are Normally distributed. This has a direct

implication for the probability distribution of yi.
The xi are considered to be constants in the model, as
are βo and β1. Then, yi is a simple linear function of
εi:
yi   0  1 xi   i .
Y is also Normally distributed.
2. Homogeneity of Variance
62
The value for σε2 is the same for all values of x.

This implies that the variability of each distribution for y
is the same regardless of the x value, as depicted
earlier.
3. Independence
63
The εi terms are independent .
Thus, the value of an error term in the model is not

related to any other observation in the data set.
This, in turn, implies that the yi values are independent.

4. Unbiasedness
64
E(εi) = 0. This indicates that model will be unbiased.

Thus, we have:
E ( y)  E(0  1 x   )  E (0  1 x)  E ( )
μY/x
0
That is, the expected value of yi is the population
regression equation.
In Addition, Fit the Best Model
65
R-sq = 0.835
S = 0.9589
R-sq = 0.981
S = 0.3308
Verifying the Required
conditions
66
Residual Analysis
67
The error terms (εi) cannot be observed directly. We

can verify that they meet the requirements by
observing the residuals.
A variety of “residual plots” can be investigated to

subjectively assess whether or not the assumptions
have been satisfied.
Recall,
ŷi = − 25.47896 + 19.30334 xi
xi yi ŷ i ei  yi  yˆ i
1.8 11 9.27 1.73

2.3 16 18.92 -2.92 Use these for
residual
2.9 31 30.50 0.50 analysis
1.6 5 5.41 -0.41

2.0 13 13.13 -0.13
2.5 24 22.78 1.22
68
Check the assumption of Normality:
 look at a Normal probability plot, of the residuals
 conduct a chi-square goodness of fit test (Normality
test) on the residuals
Check other assumptions with residual plots:

 Plot the residuals against the fitted values
 Plot the residuals against the order of observation
 Plot the residuals against the observed values of x.
69
Healthy Residual Plot
70
 a random scattering of points (no patterns)

 balanced distribution of points above and below the “0”
centerline
 constant (vertical) width
Residuals Versus the Fitted Values
 Normal spread 2
 No outliers
Standardized Residual
1
-1
-2
14.0 14.5 15.0 15.5 16.0

Fitted Value
“Non-healthy” Residual Conditions
71
72
73
74
Example of Residual Analysis
Verify εi iid N(0, σ), no outliers, appropriate model
75
Residual Plots for Y

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
99.9
2
Standardized Residual
99
90
1
Percent
0
50
-1
10
-2
1
0.1
-4 -2 0 2 4 800 1000 1200 1400 1600
Standardized Residual Fitted Value
Histogram of the Residuals Residuals Versus the Order of the Data

24 2
Standardized Residual 1
18
Frequency
0
12
-1
6
-2
0
-2 -1 0 1 2 1 10 20 30 40 50 60 70 80 90 100
Standardized Residual Observation Order
Exercise
76
The following residual plots for a regression of Y on X

indicates?
Residual Plots for Y
Normal Probability Plot Versus Fits
99.9
99 200
90 100
Residual
Percent
50 0 a. A linear model was fit to a curvilinear pattern

10 -100
1
b. The mean of the error term is not 0.
-200
0.1
-200 -100 0 100 200 300 400 500
c. The error term lacks homogeneity of variance
Residual Fitted Value d. None of the above, the plots look good
Histogram Versus Order
48 200
Frequency
100
Residual
36
0
24
-100
12
-200
0
-225 -150 -75 0 75 150 225 1 20 40 60 80 100 120 140 160 180 200 220
Residual Observation Order

Inference in Regression
77
Testing Model Significance
78
 The research question: Does the model being used

provide significant explanation of the variability in
the response variable?
Source of Degrees Sum of Mean F

Variation of Squares Square
Freedom (Variance)
SSR MSR MSR
Regression 1
MSE
Error n-2 SSE MSE
Total n-1 SST

F-Ratio
79
Expected value of F:

E (F )  E
MSR 

 2
  1
2
 ( xi  x ) 2
1
 MSE  
2
 2
When H0: β1 = 0 is true, E ( F )  2  1.0

F-ratio has an F sampling distribution with:
numerator df = dfR = p
denominator df = dfE = n – p – 1
Testing Significance of a Regression Model:
The F-Test
80
(1) Ho: 1 = 0.0 vs. Ha: 1  0.0
(2)Test Statistic: F-Ratio with Sampling Distribution: Fp,n-p-1

p = Degrees of Freedom for Numerator SSR
n – p – 1 = Degrees of Freedom for Denominator SSE
(3) a = significance level
(4) If Fobs> F(a,1,n-2), Reject Ho

Reject H0
(5) Determine F-ratio Do not
Reject H0
a = .05
(6) Decision
(7) Conclusion Fp,n-p-1

Significance of the Model for Mini-Case
Ho: 1 = 0.0 Multiple
R-Square
Adjusted StErr of
Ha: 1  0.0
0.9847 0.9695 0.9619 1.833212381
Degrees of Sum of Mean of p-

F-Ratio
ANOVA Table Freedom Squares Squares Value
Explained 1 427.89066 427.8906628 127.323 0.0004
Unexplained 4 13.442671 3.360667634

Constant -25.47895501 3.80931297 -6.6886 0.0026 -36.05530337 -14.90260665
Benefit/Cost 19.30333817 1.71071946 11.2838 0.0004 14.5536195 24.05305685
Reject H0 We conclude that there appears to be a

a = .05 significant statistical relationship
between Market Share (y) and the
Benefit/Cost Ratio (x) for the customer.
7.71
81
More General Inference about β1
82
The Sampling Distribution of b1:

 If the regression assumptions are true, b1 has a Normal distribution
with

 b = β1 and  b 
(x  x)
1
2
1 i
i
s MSE
 We estimate  b1 with SEb1  sb1  
i ( xi  x ) 2 (n  1) s X2
b1  1
 Then, ~ t(dfE).
sb1
This means that our usual Normal-based “templates” for creating a
confidence interval and for conducting a hypothesis test are valid.
Inference about β1
83
(1) We can test the hypothesis: Ho: β1 = β where β is

any specified value. The value β = 1 has relevance
in Finance applications to assess risk associated
with investment opportunities, for example.
 Most common test is for significance, β = 0.0.

H0: β = 0
Ha: β ≠ 0
(2) We can create confidence interval estimates of β1.
The t-Test about 1
84
Ho: 1 = 
Reject
Ha: 1   α/2
Reject
α/2
t
Ho: 1 =  Reject
Ha: 1 >  α
t Test Statistic =
Reject b1  1
tobs =
Ho: 1 =  α
sb1
Ha: 1 < 
t
Mini-Case: Test of Significance of Ben/Cost
85
Ho: 1 = 0.0 α = 0.05

Ha: 1  0.0
Decision Rule: Reject Reject

0.025 0.025
t
-2.776 0 2.776
b1  [0] 19.303
Test Statistic: t obs    11.28
sb1 3.36 / 1.1483
Reject the H0
It appears that the benefit/cost ratio is a significant
variable to predict the market share.
86

R-Square
0.9847 0.9695 0.9619 1.8332

F-Ratio p-Value
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607

Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531

Overview
 Goodness of Fit
 Example
Confidence & Prediction
Intervals
88

(1-α)% Confidence Interval Estimate of 1
89
Point estimate ± m sb1  SEb1 

s

MSE
b1  (ta / 2,df E ) sb1  ( xi  x )
i
2
(n  1) s x2

R-Square
0.9847 0.9695 0.9619 1.8332

F-Ratio p-Value
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607

Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531

(1 – α)% CI Estimate of μY/X
90
Point estimate ±  
1 (x  x) 
ˆ  ta 2,n  p 1 sˆ yˆ  ta 2,n  p 1 
* 2
MSE   2 
m y= 

n 
i
( x i  x ) 

95% CI Estimate of μY/X = 1.92
91
 Point Estimate: yˆ  25.479  19.303(1.92)  11.58
 Critical Value: t* = tα/2, dfE = 2.776
 1 ( x*  x ) 2 
 Standard Error: S ˆ  MSE   
2 
 n  ( xi  x ) 
 1 (1.92  2.1833) 2 
S ˆ  3.36     .8733
6 (5)(0.2297) 
 CI: 11.58 ± (2.776)(0.8733) = 11.58 ± 2.42

= 9.16% to 14%
Prediction Interval (PI) for Y
92
A special type of interval is called a Prediction Interval.

This type of interval will attempt to predict the range of
possible values for the next observed value of the
dependent variable (y). The prediction will be
dependent on specified value for the independent
variable (x).
Although the process will look similar to that of a

confidence interval, prediction intervals have a very
different purpose. Rather than attempting to capture a
parameter value as in confidence intervals, prediction
intervals actually try to capture the range of an
individual value .
(1 – α)% Prediction Interval for Y
93
 
yˆ  ta 2,n p 1 s yˆ
 1 (x  x) 
 yˆ  ta 2,n  p 1 
* 2
MSE 1   2 


n 
i
( x i  x ) 

95% PI of Y when x = 1.92
94
 Point Estimate: yˆ  25.479  19.303(1.92)  11.58
 Critical Value: t* = tα/2, dfE = 2.776
 1 ( x*  x ) 2 
 Standard Error: S yˆ  MSE 1    2 
 n  ( x  x) 
S yˆ  S 2  ( S ˆ ) 2  3.36  (0.8733) 2  2.03
 CI: 11.58 ± (2.776)(2.03) = 11.58 ± 5.64

= 5.94% to 17.22%
Graphical Comparison
Prediction vs. Confidence Intervals
95
Overview
 Goodness of Fit
 Example
Example
97

Does Amount of Time Studying Affect Your Grade
98
A student told me that he study over 10 hours every

week on my course; more than any other course, so
he deserved a grade higher than a B.
Does the amount of time one studies for a class affect
the grade in the class?
 What is the response variable?
 What is the predictor/regressor/factor?
 Is the response variable continuous or discrete?
99
The following scale was used as the response variable
E D D+ C- C C+ B- B B+ A- A
0.0 1.0 1.3 1.7 2.0 2.3 2.7 3.0 3.3 3.7 4.0
While this is discrete, one can take liberties and treat it

as continuous if your response variable has at least 9
levels, we have 11 levels.
100
A survey was conducted, and the following data was

collected
Was this a CRD or Observational study?

101
Entering the data into Minitab (16)

Select:
Stat>Regression>Regression
102
 Enter “Grade” for Response

 Enter “Study Hours” as the Predictors
Click on the “Storage”

button and select
“Standardize residuals”
then click on “OK”
Click on the “Graphs”

button and select the “Four
in one” option then click on
“OK”
Finally click on “OK”

103
In the Sessions window, the

following output will be shown.
The regression equation is

provided here
along with the other statistics.
Grade = 2.3015 + 0.0794x(Hours)

104
Testing the assumptions….

Residual Plots for Grade
Normal Probability Plot Versus Fits
99
1.0
90
0.5
Residual
Percent
50 0.0
-0.5
10
-1.0
1
-1.0 -0.5 0.0 0.5 1.0 2.5 3.0 3.5 4.0
Residual Fitted Value
Histogram Versus Order

16
1.0
12
0.5
Frequency
Residual
8 0.0
-0.5
4
-1.0
0
-1.5 -1.0 -0.5 0.0 0.5 1.0 2 4 6 8 10 12 14 16 18 20
Residual Observation Order
105
 Since we stored the standardized residuals, we can

test to see if they’re normally distributed.
Stat>Basic Statistics>Normality Test

106
Enter the standardized residuals, SRES1 as the variable in the dialog box
Click on “OK”
Probability Plot of SRES1
Normal
99
Mean 0.004307
StDev 1.035
95 N 20
AD 0.996
90
P-Value 0.010
80
70
Percent
60
50
40
30
20
10
5
1
-3 -2 -1 0 1 2 3
SRES1
The null and alternative hypotheses are: Since the p-value is < 0.05, we
Ho: Data follow a normal distribution reject the null and determine the
Ha: Data do not follow a normal distribution residuals are not normal.
Time studying alone does not appear to be a good predictor of grade!

Summary of Key Learning Points
107
 Correlation coefficient is for linear correlation

 Method of least square minimizes the error
associated with the line and the observed data
 Goodness of Fit: R2, s and F-ratio
 Residuals need to be
 Normal, Independent, equal variances and E{ei} = 0
 CI < PI and are smallest at x-bar

AU 15 Lecture 4 SLR

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

AU 15 Lecture 4 SLR

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture #4

“Common sense is the collection of prejudices acquired by age

1. Understand how covariance and correlation are used to assess

2. Understand the difference between deterministic and

3. Understand how regression analysis can be used to develop

4. Know how to fit a simple (first order) linear regression model to

5. Know how to interpret the coefficients in the regression

6. Know how to calculate point estimates of the conditional mean

7. Be able to determine the goodness of fit of the estimated

8. Know how to read computer output from the regression

Simple Linear Regression

- Correlation -Logistics Regression

X, Input(s) - T-test - Proportions

Scatterplot of Coffee vs Temperature

A major newspaper printing facility is interested in

 A response (dependent) variable is predicted or estimated:

 One or more predictor or explanatory (independent)

 y = f(x1, x2, . . , xp)

 Use 1 predictor (input)

 Relationship is first order

Suppose for every unit of a particular product sold,

Uncertainty in inputs (labor, raw materials, time, etc.)

There is a recognizable relationship between X and Y

Which of the following statements concerning the

SIMPLE LINEAR REGRESSION

A Japanese manufacturing firm interested in

 Cost: Combined depreciation, operating, and

 The response variable, also called the

 The predictor variable, also called the

Market Share (%) 11 16 31 5 13 24

Benefit/Cost Ratio 1.8 2.3 2.9 1.6 2.0 2.5

Data Source: Shigeru Mizuno, Company-Wide Quality

SIMPLE LINEAR REGRESSION

The first order stochastic model that will be used to

βo = Y-axis intercept parameter of the regression line

β1 = The slope parameter of the regression line.

εi = The error or deviation term of the actual Yi value from the

bo = The statistical estimator for β0.

b1 = The statistical estimator for β1.

ei  Yi  Yˆi = The difference between the predicted value

Whenever a statistic is used to estimate a parameter,

we want to test to determine if the value is significant

1.50 1.75 2.00 2.25 2.50 2.75 3.00

To determine the values for bo and b1, we minimize the

This equation is called the Least Squares Criterion.

The formulas for bo and b1 are guaranteed to

Consider the value for the criterion if all predictions

Computatio nal Equations :

Multiple Adjusted StErr of

0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of

Explained 1 427.8907 427.8907 127.3231 0.0004

Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%

Regression Coefficient t-Value p-Value

Table Error Lower Upper

Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026

Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531

The regression equation:

For benefit/cost = 1.92:

The Estimate of the mean market share for all companies

ŷ = − 25.47896 + 19.30334 (1.92) = 11.58%

b1:a one unit increase in the benefit/cost ratio is

b0: a market share of -25.479% is expected, or