Sunteți pe pagina 1din 106

Lecture #4

REGRESSION
O N E - FA C TO R E X P E R I M E N TA L D E S I G N

“Common sense is the collection of prejudices acquired by age


eighteen.”
----Albert Einstein (1879 – 1955)
Learning Objectives
2

1. Understand how covariance and correlation are used to assess


strength of linear relationship between two variables.

2. Understand the difference between deterministic and


stochastic (statistical) relationships.

3. Understand how regression analysis can be used to develop


an equation that estimates the stochastic (statistical)
relationship between two variables.

4. Know how to fit a simple (first order) linear regression model to


a sample data set using the least squares method.
Learning Objectives
3

5. Know how to interpret the coefficients in the regression


equation.

6. Know how to calculate point estimates of the conditional mean


response and point predictions of the response.

7. Be able to determine the goodness of fit of the estimated


regression model.

8. Know how to read computer output from the regression


analysis.
Terminology
4

Simple Linear Regression


Covariance Correlation
Predictor Variable Independent Variable
Response Variable Dependent Variable
Conditional Mean Residual
Slope Intercept
Sum of Squares Total Sum of Squares Error
Sum of Squares Regression Goodness of Fit
Coefficient of Determination R2
Standard Error of the Estimate
Why Study Regression?

Y, Response variable
Continuous Discrete
(Output has a mean and variance) (Output is a proportion, i.e., 15 out of 50 or 30%)

- Correlation -Logistics Regression


Continuous - Regression*

X, Input(s) - T-test - Proportions


- Paired
- One Sample - One Proportion
- Two Sample - Two Proportions
-Equal var.
Discrete
-Unequal var.
- ANOVA (an X with 2 or more means - Chi Sq (more than 2 proportions)
or 2 or more X’s being investigated)

Continuous: data that can be subdivided into smaller and smaller increments, i.e., Time: Weeks, Hours, Minutes, seconds, etc.
Discrete: data that falls into distinct categories, i.e., Gender (Male or Female), Day of Week (Mon, Tues, etc.), University (OSU,MSU, etc.)

* Regression is a tool that uses a one-sample t hypothesis test, constant = 0, to test for significance.
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Describing Relationships
7

TWO QUANTITATIVE
VARIABLES
Scatter Plot
8

Scatterplot of Coffee vs Temperature


 Scatterplot 70000
x
35.08

60000

50000

40000
Coffees

30000
y
29913

20000

10000

0
-10 0 10 20 30 40 50 60 70
Temperature

s xy  xi  x  yi  y 

1
 Correlation r    
sx s y n  1  
 s x  s y 
 Equation
"r" ranges
from -1 to +1
r is a sample statistic
that estimates the
population parameter, ρ

Strength: how
closely the points
follow a straight line.

Direction: is positive
when individuals with
higher X values tend
to have higher values
of Y.
9
Application
10

A major newspaper printing facility is interested in


understanding those factors (inputs) that affect the quality
of the printing produced by their equipment (output).

Important inputs?
Regression Analysis
11
Regression Analysis is a technique used to build an
equation that can be used to estimate or predict the value
of one variable by using its relationship with one or more
other variables, i.e., describe the “tendency” or the “FIT,”
with an equation:

 A response (dependent) variable is predicted or estimated:


Y

 One or more predictor or explanatory (independent)


variables: X1, X2, . . , Xp

 y = f(x1, x2, . . , xp)


 E.g., y = a + bx1 + cx2
Simple Linear Regression
12

 Use 1 predictor (input)

 Relationship is first order

 Straight line: y = a + bx
Deterministic Relationship
13

Suppose for every unit of a particular product sold,


profit is exactly $5: Y = $5X.
60

50

40
Profit

30

20

10

0
0 1 2 3 4 5 6 7 8 9 10 11
Units Sold
The First Order Deterministic Equation
14
The value of Y is perfectly determined by the value of X.

y
Slope = b =
x

Y-intercept = a
Stochastic Relationship
15

Uncertainty in inputs (labor, raw materials, time, etc.)


creates uncertainty in profit:
60.0

50.0

40.0
Profit

30.0

20.0

10.0

0.0
0 2 4 6 8 10 12
Units Sold
The First Order Stochastic Equation
16

There is a recognizable relationship between X and Y


(pattern), f(y|x)
but it is not perfect.

yi  a  bxi  ei

Y
x1
x2

X yˆ  a  bx
Exercise
17

Which of the following statements concerning the


coefficient of simple linear correlation, r, is true?
a. r = 0.00 represents the absence of a relationship
b. The relationship between the two variables must be nonlinear
c. r = 0.76 has the same predictive power as r = -0.76
d. If r is negative, as one variable increases, the other will increase
e. Both a. and c. are true
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Application
19

SIMPLE LINEAR REGRESSION


Mini-Case: Industrial Printing Equipment
20

A Japanese manufacturing firm interested in


expanding its production of heavy duty printing
equipment conducted research to help identify the
determinants of increased market share. A detailed
study of costs and benefits was conducted. Reverse
engineering, benchmarking, and customer surveys
were used to establish the data base.
Definitions:
 Market Share: Percentage of worldwide market
share as determined by an independent research
firm.

 Cost: Combined depreciation, operating, and


maintenance costs.
 Benefit = Product of 3 factors:
Expected profit from the press  production
speed  expected lifespan of press.

21
The company hopes to better understand
market share by considering how it relates to
the costs and benefits of the additional press:

 The response variable, also called the


dependent or criterion variable, is market
share. This is Y.

 The predictor variable, also called the


explanatory or independent variable, is the
ratio benefit/cost. This is (an) X.

22
The Data
23

Company 1 2 3 4 5 6

Market Share (%) 11 16 31 5 13 24

Benefit/Cost Ratio 1.8 2.3 2.9 1.6 2.0 2.5

Data Source: Shigeru Mizuno, Company-Wide Quality


Control, APO, 1988.
The Scatter Plot
24

Conclusions:
Scatter Plot For the Printing Industry

35

30
Y =Market Share (%)

25
(x5 = 2.0, y5 = 13)
20

15

10

0
0 1 2 3
X = Benefit/Cost Ratio
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Key Components
26

SIMPLE LINEAR REGRESSION


The Population Regression Model
27

The first order stochastic model that will be used to


represent the relationship between the dependent
variable Y and the independent variable X:
Yi   0  1 X i   i
Data = Fit + Scatter
y/x
Yi = The observed (true) value of the dependent
variable for observation i.
The Population Regression Model Parameters
28

βo = Y-axis intercept parameter of the regression line


Some texts use the notation α for the intercept term; use them interchangeably.

β1 = The slope parameter of the regression line.


Some texts use the notation β without the subscript for the special case of simple regression.

εi = The error or deviation term of the actual Yi value from the


population regression model.
29
The Least Squares Regression Equation
30

Yˆi  b0  b1 X i
Yˆ i = The estimated mean value and the predicted value
of the dependent variable. Do not confuse this value with
Yi.

bo = The statistical estimator for β0.

b1 = The statistical estimator for β1.

ei  Yi  Yˆi = The difference between the predicted value


and the true value. It is called the residual or “error”.
Hypothesis Test
31

Whenever a statistic is used to estimate a parameter,

E{bi} = i,

we want to test to determine if the value is significant

Ho: i = 0
Ha: i = 0
The Residuals
32

Scatterplot of y vs x

30

25

20 ŷ i ei  yi  yˆ i
y

16.67
15
yi
10

1.50 1.75 2.00 2.25 2.50 2.75 3.00


x
The Least Squares Estimates
33

To determine the values for bo and b1, we minimize the


sum of squared errors:
n n
min  (Yi  Yi )  min
ˆ 2
e 2
i  min( e  e    e )
2
1
2
2
2
n
i 1 i 1

Since Yˆi  b0  b1 X i:
n n
min  i 0 1 i  min
(Y
i 1
 (b  b X )) 2
 i 0 1 i
(Y
i 1
 b  b X ) 2

This equation is called the Least Squares Criterion.


The derivation for the equations for bo and b1 is left as
a homework assignment for the students.

The formulas for bo and b1 are guaranteed to


minimize the sum of squared deviations of the
predicted value to the observed value of Yi. This is a
logical criterion.

Consider the value for the criterion if all predictions


were exactly correct.

34
Formulas for the sample slope and intercept
35

b1 
 ( xi  x )( yi  y ) Cov( x, y )
 r
sy
and b0  y  b1 x
 ( xi  x ) 2 2
sX sx
where

Cov(x, y) 
 ( xi  x )( yi  y ) , s 2X 
 i
( x  x ) 2
, s 2y 
 i
( y  y ) 2

n 1 n 1 n 1

Computatio nal Equations :

b1 
 xi yi  ( xi )( yi ) / n  xi yi  nx y

 x 2
i
 (  i
x ) 2
/ n  x 2
 nxi
2
Computations for Our Case
36

i xi yi ( xi  x ) ( yi  y ) ( xi  x ) ( y i  y ) ( xi  x ) 2 ( yi  y ) 2
1 1.8 11.0 -0.3833 -5.6667 2.1720 0.1469 32.1111
2 2.3 16.0 0.1167 -0.6667 -0.0778 0.0136 0.4444
3 2.9 31.0 0.7167 14.3333 10.2725 0.5136 205.4444
4 1.6 5.0 -0.5833 -11.6667 6.8052 0.3403 136.1111
5 2.0 13.0 -0.1833 -3.6667 0.6721 0.0336 13.4444
6 2.5 24.0 0.3167 7.3333 2.3225 0.1003 53.7778
13.1 100.0 0.00000 0.0000 22.1665 1.14833 441.3332
13.1 100 22.1665 1.14833 441.3332
x y s xy  s x2  s 2y 
6 6 5 5 5
x  2.1833 y  16.6667 s xy  4.4333 s x2  0.22967 s 2y  88.2666

4.4333
r  0.985
0.22967 88.2666
Exercise
37
Ben/Cost MktShr Ben/Cost MktShr
Correlation One Variable Data Set Data Set
Table Data Set #1 Data Set #1 Summary #1 #1
Ben/Cost 1.000 0.985 Mean 2.1833 16.667
MktShr 0.985 1.000
Variance 0.2297 88.267
Ben/Cost MktShr Std. Dev. 0.4792 9.395
Covariance Minimum 1.6000 5.000
Table Data Set #1 Data Set #1
Maximum 2.9000 31.000
Ben/Cost 0.2297 4.4333
MktShr 4.4333 88.2667 Count 6 6

b1= b0 =
Help From Technology
38

Multiple Adjusted StErr of


R-Square
Summary R R-Square Estimate

0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of


F-Ratio p-Value
ANOVA Table Freedom Squares Squares

Explained 1 427.8907 427.8907 127.3231 0.0004

Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%

Regression Coefficient t-Value p-Value

Table Error Lower Upper

Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026

Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531

The regression equation:


Estimating the Conditional Mean of Y
39

For benefit/cost = 1.92:

The Estimate of the mean market share for all companies


with a certain benefit/cost ratio, y / x 1.92 :

ŷ = − 25.47896 + 19.30334 (1.92) = 11.58%


Interpreting the Coefficients
40

b1:a one unit increase in the benefit/cost ratio is


expected, or predicted, to be accompanied by or
associated with an increase in market share of
19.3%.

b0: a market share of -25.479% is expected, or


predicted, when benefit/cost is 0.
Beware of extrapolation!!!!!
Extrapolation can occur in both directions and can have
very serious consequences:

42
Exercise
43

In the linear regression model: Yi = 0 + lXi + i, the


terms 0 and l represent:

a. The Y intercept and the slope of the population regression line


b. The Y intercept and the slope of the sample regression line
c. Statistics used to estimate the parameters b0 and b1
d. Parameters used to estimate the statistics b0 and b1
e. None of the other choices is correct.
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Evaluating the Goodness of Fit
45

HOW ACCURATE WILL


ESTIMATES/PREDICTIONS
BE?
Total Variation
46

Considering the sample values of Market Share


without the information provided by Ben/Cost:
Market Shares: 11 16 31 5 13 24
y = 16.6667 sy2 = 88.26667

sY2 
 ( yi  y ) 2

 ( yi  16.6667 ) 2

441 .3334 SSTotal

n 1 6 1 5 dfT

Note that SSTotal  (n  1)sY


2
Partitioning the SSTotal

24 – 22.78 = 1.22
(2.50, 24%)
22.78

22.78 – 16.67 = 6.11 24 – 16.67 = 7.33

47
SSTotal (SST) =  ( yi  yi ) 2 = 441.33

SSE = Sum of Squares Error = SS Unexplained

 ( yi  yˆi ) 2 =  ei2 = 13.44

SSR = Sum of Squares Regression = SS Explained

 ( yˆi  y) 2 = 427.89

SSTotal = SSR + SSE

48
Help From Technology
49

Multiple Adjusted StErr of


R-Square
Summary R R-Square Estimate

0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of


F-Ratio p-Value

ANOVA Table Freedom Squares Squares

Explained 1 427.8907 427.8907 127.3231 0.0004

Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%


Coefficient t-Value p-Value
Regression Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531
The Coefficient of Determination, R2
50

R2 measures explained variation relative to the total:


SSR
R 
2
SST
What if SSE = 0.0?

What if SSR = 0.0?


Then, 0 ≤ R2 ≤ 1, and as R2 increases from 0 toward 1:

In general, R2 can be interpreted as the proportion of


the variation in the response variable (y) that can
be accounted for (predicted, explained) by the
model.

51
The Standard Error of the Model, s
52

We can measure the accuracy of prediction for a


regression equation by examining the residuals.

Error variance is used to measure how closely the


points cluster around the regression line.

Error variance in a regression refers to the variance of


the probability distribution of y for each given value of
x, i.e. variance of the conditional probability
distribution of y given x.
s 2

 ( yi  yi )
ˆ 2

 
2
eiSSE
 ______
n2 n  2 df E

The dfE = n – 2 only in simple linear regression. In


general, dfE = n – (p* + 1).
* Where p is the number of variables you’re regressing with

s
 ( yi  yi )
ˆ 2


SSE
 MSE
n2 n2
53
Interpretation of s
 Compare it to the values of the response variable:
Range: 5 to 31
y = 16.6667
sY2 = 88.2667

 Considering the implication of the Empirical Rule:

In our example, then, the residuals (errors) for


predicting market share should be no larger than
2(1.8337) = 3.67 percentage points 95% of the time.

54
Help From Technology
55

Multiple Adjusted StErr of


R-Square
Summary R R-Square Estimate
0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of


F-Ratio p-Value
ANOVA Table Freedom Squares Squares

Explained 1 427.8907 427.8907 127.3231 0.0004

Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%


Regression Coefficient t-Value p-Value
Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026
Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531
Calculating SSE
56

Recall that ŷ i = − 25.47896 + 19.30334 xi.


Exercise
57

Suppose the line ŷ = - 0.5 + 2x has been fitted to the


(x, y) data points (4, 8), (2, 5), and (1, 1). What is the
value of the SSE for these points if the line was used
to predict y?

a. 0
b. 24.667
c. 1.5
d. 2.75
Exercise
58

If the linear relationship between two variables, X and


Y, is deterministic, a least squares regression will
result in which of the following?
I Standard error of the estimate equal to 1
II SSE = SST
III Coefficient of Determination = 1

a. Only I is true.
b. Only II is true.
c. Only III is true.
d. II and III are both true.
e. I, II, and III are all true.
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Required Data Conditions
for Inference: εi iid N(0, σ)
60

SIMPLE LINEAR REGRESSION


1. Normality
61

The εi terms are Normally distributed. This has a direct


implication for the probability distribution of yi.
The xi are considered to be constants in the model, as
are βo and β1. Then, yi is a simple linear function of
εi:

yi   0  1 xi   i .
Y is also Normally distributed.
2. Homogeneity of Variance
62

The value for σε2 is the same for all values of x.


This implies that the variability of each distribution for y
is the same regardless of the x value, as depicted
earlier.
3. Independence
63

The εi terms are independent .

Thus, the value of an error term in the model is not


related to any other observation in the data set.

This, in turn, implies that the yi values are independent.


4. Unbiasedness
64

E(εi) = 0. This indicates that model will be unbiased.


Thus, we have:

E ( y)  E(0  1 x   )  E (0  1 x)  E ( )
μY/x
0
That is, the expected value of yi is the population
regression equation.
In Addition, Fit the Best Model
65

R-sq = 0.835
S = 0.9589

R-sq = 0.981
S = 0.3308
Verifying the Required
conditions
66
Residual Analysis
67

The error terms (εi) cannot be observed directly. We


can verify that they meet the requirements by
observing the residuals.

A variety of “residual plots” can be investigated to


subjectively assess whether or not the assumptions
have been satisfied.
Recall,
ŷi = − 25.47896 + 19.30334 xi

xi yi ŷ i ei  yi  yˆ i

1.8 11 9.27 1.73


2.3 16 18.92 -2.92 Use these for
residual
2.9 31 30.50 0.50 analysis

1.6 5 5.41 -0.41


2.0 13 13.13 -0.13
2.5 24 22.78 1.22

68
Check the assumption of Normality:
 look at a Normal probability plot, of the residuals
 conduct a chi-square goodness of fit test (Normality
test) on the residuals

Check other assumptions with residual plots:


 Plot the residuals against the fitted values
 Plot the residuals against the order of observation
 Plot the residuals against the observed values of x.

69
Healthy Residual Plot
70

 a random scattering of points (no patterns)


 balanced distribution of points above and below the “0”
centerline
 constant (vertical) width
Residuals Versus the Fitted Values

 Normal spread 2

 No outliers
Standardized Residual
1

-1

-2

14.0 14.5 15.0 15.5 16.0


Fitted Value
“Non-healthy” Residual Conditions
71
“Non-healthy” Residual Conditions
72
“Non-healthy” Residual Conditions
73
“Non-healthy” Residual Conditions
74
Example of Residual Analysis
Verify εi iid N(0, σ), no outliers, appropriate model
75

Residual Plots for Y


Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
99.9
2

Standardized Residual
99

90
1
Percent

0
50
-1
10
-2
1
0.1
-4 -2 0 2 4 800 1000 1200 1400 1600
Standardized Residual Fitted Value

Histogram of the Residuals Residuals Versus the Order of the Data


24 2

Standardized Residual 1
18
Frequency

0
12
-1
6
-2
0
-2 -1 0 1 2 1 10 20 30 40 50 60 70 80 90 100
Standardized Residual Observation Order
Exercise
76

The following residual plots for a regression of Y on X


indicates?
Residual Plots for Y
Normal Probability Plot Versus Fits
99.9
99 200

90 100
Residual
Percent

50 0 a. A linear model was fit to a curvilinear pattern


10 -100
1
b. The mean of the error term is not 0.
-200
0.1
-200 -100 0 100 200 300 400 500
c. The error term lacks homogeneity of variance
Residual Fitted Value d. None of the above, the plots look good
Histogram Versus Order
48 200
Frequency

100
Residual

36
0
24
-100
12
-200
0
-225 -150 -75 0 75 150 225 1 20 40 60 80 100 120 140 160 180 200 220

Residual Observation Order


Inference in Regression
77
Testing Model Significance
78

 The research question: Does the model being used


provide significant explanation of the variability in
the response variable?

Source of Degrees Sum of Mean F


Variation of Squares Square
Freedom (Variance)
SSR MSR MSR
Regression 1
MSE
Error n-2 SSE MSE

Total n-1 SST


F-Ratio
79

Expected value of F:


E (F )  E
MSR 

 2
  1
2
 ( xi  x ) 2
1
 MSE  
2

 2
When H0: β1 = 0 is true, E ( F )  2  1.0

F-ratio has an F sampling distribution with:
numerator df = dfR = p
denominator df = dfE = n – p – 1
Testing Significance of a Regression Model:
The F-Test
80

(1) Ho: 1 = 0.0 vs. Ha: 1  0.0

(2)Test Statistic: F-Ratio with Sampling Distribution: Fp,n-p-1


p = Degrees of Freedom for Numerator SSR
n – p – 1 = Degrees of Freedom for Denominator SSE

(3) a = significance level

(4) If Fobs> F(a,1,n-2), Reject Ho


Reject H0
(5) Determine F-ratio Do not
Reject H0
a = .05
(6) Decision

(7) Conclusion Fp,n-p-1


Significance of the Model for Mini-Case
Ho: 1 = 0.0 Multiple
R-Square
Adjusted StErr of

Ha: 1  0.0
Summary R R-Square Estimate

0.9847 0.9695 0.9619 1.833212381

Degrees of Sum of Mean of p-


F-Ratio
ANOVA Table Freedom Squares Squares Value
Explained 1 427.89066 427.8906628 127.323 0.0004
Unexplained 4 13.442671 3.360667634

Standard Confidence Interval 95%


Coefficient t-Value p-Value
Regression Table Error Lower Upper
Constant -25.47895501 3.80931297 -6.6886 0.0026 -36.05530337 -14.90260665
Benefit/Cost 19.30333817 1.71071946 11.2838 0.0004 14.5536195 24.05305685

Reject H0 We conclude that there appears to be a


a = .05 significant statistical relationship
between Market Share (y) and the
Benefit/Cost Ratio (x) for the customer.
7.71
81
More General Inference about β1
82

The Sampling Distribution of b1:


 If the regression assumptions are true, b1 has a Normal distribution
with

 b = β1 and  b 
(x  x)
1
2
1 i
i

s MSE
 We estimate  b1 with SEb1  sb1  
i ( xi  x ) 2 (n  1) s X2

b1  1
 Then, ~ t(dfE).
sb1
This means that our usual Normal-based “templates” for creating a
confidence interval and for conducting a hypothesis test are valid.
Inference about β1
83

(1) We can test the hypothesis: Ho: β1 = β where β is


any specified value. The value β = 1 has relevance
in Finance applications to assess risk associated
with investment opportunities, for example.

 Most common test is for significance, β = 0.0.


H0: β = 0
Ha: β ≠ 0
(2) We can create confidence interval estimates of β1.
The t-Test about 1
84

Ho: 1 = 
Reject
Ha: 1   α/2
Reject
α/2
t

Ho: 1 =  Reject
Ha: 1 >  α

t Test Statistic =
Reject b1  1
tobs =
Ho: 1 =  α
sb1
Ha: 1 < 
t
Mini-Case: Test of Significance of Ben/Cost
85

Ho: 1 = 0.0 α = 0.05


Ha: 1  0.0

Decision Rule: Reject Reject


0.025 0.025
t
-2.776 0 2.776

b1  [0] 19.303
Test Statistic: t obs    11.28
sb1 3.36 / 1.1483
Reject the H0
It appears that the benefit/cost ratio is a significant
variable to predict the market share.
Help From Technology
86

Multiple Adjusted StErr of


R-Square
Summary R R-Square Estimate
0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of


F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 1 427.8907 427.8907 127.3231 0.0004
Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%


Coefficient t-Value p-Value
Regression Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026

Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531


Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Confidence & Prediction
Intervals
88

SIMPLE LINEAR REGRESSION


(1-α)% Confidence Interval Estimate of 1
89

Point estimate ± m sb1  SEb1 


s

MSE
b1  (ta / 2,df E ) sb1  ( xi  x )
i
2
(n  1) s x2

Multiple Adjusted StErr of


R-Square
Summary R R-Square Estimate
0.9847 0.9695 0.9619 1.8332

Degrees of Sum of Mean of


F-Ratio p-Value
ANOVA Table Freedom Squares Squares
Explained 1 427.8907 427.8907 127.3231 0.0004

Unexplained 4 13.4427 3.3607

Standard Confidence Interval 95%


Coefficient t-Value p-Value
Regression Table Error Lower Upper
Constant -25.47896 3.8093 -6.6886 0.0026 -36.0553 -14.9026

Benefit/Cost 19.30334 1.7107 11.2838 0.0004 14.5536 24.0531


(1 – α)% CI Estimate of μY/X
90

Point estimate ±  
1 (x  x) 
ˆ  ta 2,n  p 1 sˆ yˆ  ta 2,n  p 1 
* 2
MSE   2 
m y= 

n 
i
( x i  x ) 

95% CI Estimate of μY/X = 1.92
91

 Point Estimate: yˆ  25.479  19.303(1.92)  11.58

 Critical Value: t* = tα/2, dfE = 2.776

 1 ( x*  x ) 2 
 Standard Error: S ˆ  MSE   
2 
 n  ( xi  x ) 

 1 (1.92  2.1833) 2 
S ˆ  3.36     .8733
6 (5)(0.2297) 

 CI: 11.58 ± (2.776)(0.8733) = 11.58 ± 2.42


= 9.16% to 14%
Prediction Interval (PI) for Y
92

A special type of interval is called a Prediction Interval.


This type of interval will attempt to predict the range of
possible values for the next observed value of the
dependent variable (y). The prediction will be
dependent on specified value for the independent
variable (x).

Although the process will look similar to that of a


confidence interval, prediction intervals have a very
different purpose. Rather than attempting to capture a
parameter value as in confidence intervals, prediction
intervals actually try to capture the range of an
individual value .
(1 – α)% Prediction Interval for Y
93
 
yˆ  ta 2,n p 1 s yˆ
 1 (x  x) 
 yˆ  ta 2,n  p 1 
* 2
MSE 1   2 


n 
i
( x i  x ) 

95% PI of Y when x = 1.92
94

 Point Estimate: yˆ  25.479  19.303(1.92)  11.58

 Critical Value: t* = tα/2, dfE = 2.776

 1 ( x*  x ) 2 
 Standard Error: S yˆ  MSE 1    2 
 n  ( x  x) 

S yˆ  S 2  ( S ˆ ) 2  3.36  (0.8733) 2  2.03

 CI: 11.58 ± (2.776)(2.03) = 11.58 ± 5.64


= 5.94% to 17.22%
Graphical Comparison
Prediction vs. Confidence Intervals
95
Overview

 Correlation Coefficient
 Simple Linear Regression (SLR) Example
 Components of SLR
 Goodness of Fit
 Regression Requirements
 Confidence & Prediction Intervals
 Example
Example
97

SIMPLE LINEAR REGRESSION


Does Amount of Time Studying Affect Your Grade
98

A student told me that he study over 10 hours every


week on my course; more than any other course, so
he deserved a grade higher than a B.
Does the amount of time one studies for a class affect
the grade in the class?
 What is the response variable?
 What is the predictor/regressor/factor?
 Is the response variable continuous or discrete?
Does Amount of Time Studying Affect Your Grade
99

The following scale was used as the response variable

E D D+ C- C C+ B- B B+ A- A

0.0 1.0 1.3 1.7 2.0 2.3 2.7 3.0 3.3 3.7 4.0

While this is discrete, one can take liberties and treat it


as continuous if your response variable has at least 9
levels, we have 11 levels.
Does Amount of Time Studying Affect Your Grade
100

A survey was conducted, and the following data was


collected

Was this a CRD or Observational study?


Does Amount of Time Studying Affect Your Grade
101

Entering the data into Minitab (16)


Select:
Stat>Regression>Regression
Does Amount of Time Studying Affect Your Grade
102

 Enter “Grade” for Response


 Enter “Study Hours” as the Predictors

Click on the “Storage”


button and select
“Standardize residuals”
then click on “OK”

Click on the “Graphs”


button and select the “Four
in one” option then click on
“OK”

Finally click on “OK”


Does Amount of Time Studying Affect Your Grade
103

In the Sessions window, the


following output will be shown.

The regression equation is


provided here
along with the other statistics.

Grade = 2.3015 + 0.0794x(Hours)


Does Amount of Time Studying Affect Your Grade
104

Testing the assumptions….


Residual Plots for Grade
Normal Probability Plot Versus Fits
99
1.0
90
0.5

Residual
Percent

50 0.0
-0.5
10
-1.0
1
-1.0 -0.5 0.0 0.5 1.0 2.5 3.0 3.5 4.0
Residual Fitted Value

Histogram Versus Order


16
1.0
12
0.5
Frequency

Residual

8 0.0

-0.5
4
-1.0
0
-1.5 -1.0 -0.5 0.0 0.5 1.0 2 4 6 8 10 12 14 16 18 20
Residual Observation Order
Does Amount of Time Studying Affect Your Grade
105

 Since we stored the standardized residuals, we can


test to see if they’re normally distributed.

Stat>Basic Statistics>Normality Test


Does Amount of Time Studying Affect Your Grade
106

Enter the standardized residuals, SRES1 as the variable in the dialog box
Click on “OK”
Probability Plot of SRES1
Normal
99
Mean 0.004307
StDev 1.035
95 N 20
AD 0.996
90
P-Value 0.010
80
70

Percent
60
50
40
30
20

10
5

1
-3 -2 -1 0 1 2 3
SRES1

The null and alternative hypotheses are: Since the p-value is < 0.05, we
Ho: Data follow a normal distribution reject the null and determine the
Ha: Data do not follow a normal distribution residuals are not normal.

Time studying alone does not appear to be a good predictor of grade!


Summary of Key Learning Points
107

 Correlation coefficient is for linear correlation


 Method of least square minimizes the error
associated with the line and the observed data
 Goodness of Fit: R2, s and F-ratio
 Residuals need to be
 Normal, Independent, equal variances and E{ei} = 0
 CI < PI and are smallest at x-bar

S-ar putea să vă placă și