Regression 3D

Chapter 17
Simple Linear Regression

and Correlation
m
17.1 Introduction
In this chapter we employ Regression Analysis
to examine the relationship among quantitative
variables.
The technique is used to predict the value of one
variable (the dependent variable - y)based on
the value of other variables (independent
variables x1, x2,«xk.)

17.2 The Model
The first order linear model
y ½ Ô ½1 x
½Ô and ½1 are unknown,
y = dependent variable y therefore, are estimated
from the data.
x = independent variable
½Ô = y-intercept
Rise
½1 = slope of the line ½ = Rise/Run
½ Run
= error variable x
Ô
17.3 Estimating the Coefficients
The estimates are determined by
± drawing a sample from the population of interest,
± calculating sample statistics.
± producing a straight line that cuts into the data.
y u The question is:
u Which straight line fits best?
u
u
u u u u u
u u u u u
u
x °
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.
Sum of squared differences = (2 - 1)2 + (° - 2)2 +(1.5 - 3)2 + (3.2 - °)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (° - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
° (2,°)
u The second line is horizontal
3 u (°,3.2)
2.5
2
(1,2) u The smaller the sum of
u (3,1.5)
1 squared differences
the better the fit of the
1 2 3 °
line to the data. ¦
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:
cov( X , Y)
b1
s 2x yÖ b Ô b1x
b Ô y r b1x
-
Example 17.1 Relationship between odometer
reading and a used car¶s selling price.
± A car dealer wants to find ar Od o met er Pric e
the relationship between 1 7 1
447 Ô 1
the odometer reading and 4 ÔÔ
the selling price of used cars. 4 Ô 7
17Ô 74
± A random sample of 1ÔÔ 4Ô1Ô
cars is selected, and the data . . .
. . .
recorded. . . .
± Find the regression line.
Independent variable x
Dependent variable y

Solution
± Solving by hand
To calculate bÔ and b1 we need to calculate several
statistics first; 2
(xi x)
x Ê 36 ,ÔÔ9 . °5 ; s 2x Ê Ê °3 ,528 ,688
n 1
(xi x )( y i y)
y Ê 5 ,°11 . °1; cov( , ) Ê Ê 1,356 ,256
n 1
where n = 1ÔÔ.
cov( , ) 1,356,256
b1 Ê Ê Ê .Ô312
s 2x °3,528,688
bÔ Ê y b1x Ê 5°11.°1 ( .Ô312)(36,ÔÔ9.°5) Ê 6,533
yÖ Ê b Ô b 1 x Ê 6 ,533 . Ô312 x D
± Using the computer (see file Xm17-Ô1.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > K
6

?

6
6
°
6 °
!"#
yÖ 6 ,533 r . Ô312 x
$%

&# '

( ()(
#
(( (

*+ ))) '(
#, ' ( '() ' [
6533

°
Ô o data °

yÖ Ê 6 ,533 r . Ô312 x
The intercept is bÔ = 6533. This is the slope of the line.

For each additional mile on the odometer,
the price decreases by an average of $Ô.Ô312
Do not interpret the intercept as the
³ rice of cars that have not been driven´
m
17.° Error Variable: Required Conditions
The error is a critical part of the regression model.

Four requirements involving the distribution of must
be satisfied.
± The probability distribution of is normal.
± The mean of is zero: E( ) = Ô.
± The standard deviation of is for all values of x.
± The set of errors associated with different values of y are
all independent.
mm
From the first three assumptions we have:
y is normally distributed with mean
E(y) = ½Ô + ½1x, and a constant standard
deviation
E(y|x3)
The standard deviation remains constant,
½Ô + ½1x3
E(y|x2)
½Ô + ½1x2
but the mean value changes with x E(y|x1)
½Ô + ½1x1
x1 x2 x3
m
17.5 Assessing the Model
The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model:
± Testing and/or estimating the coefficients.
± Using descriptive measurements.
mÔ
Sum of squares for errors
± This is the sum of differences between the points
and the regression line.
± It can serve as a measure of how well the line fits the
n
data. SSE ( y i r yÖ i ) 2 .
i 1
cov( X , Y )
SSE (n r 1)s 2Y r
s 2x
± This statistic plays a role in every statistical

technique we employ to assess the model.
m°
Standard error of estimate
± The mean error is equal to zero.
± If is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
± Therefore, we can, use as a measure of the
suitability of using a linear model.
± An unbiased estimator of 2 is given by s 2
S tan dard Error of Estimate
SSE
s
nr2 m¦
Example 17.2
± Calculate the standard error of estimate for example
17.1, and describe what does it tell you about the
model fit?
Solution
2
( y i r yÖ i ) 2 6,°3°,89Ô Calculated before
s Ê Ê Ê 6°,999
n r1 99
cov( , ) ( r1,356,256) 2
SSE Ê (n r 1)s 2 r Ê 99(6°,999) r Ê 2,252,363
s 2x °3,528,688
Thus, It is hard to assess the model based
SSE 2,251,363 on s even when compared with the
s Ê Ê 151.6
nr2 98 mean value of y.
s Ê 151.6, y Ê 5,°11.° m-
Testing the slope
± When no linear relationship exists between two
variables, the regression line should be horizontal.
Î
Î
Î
Î Î Î
Î
Î
Î Î
Î Î
Linear relationship. o linear relationship.

Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).
The slope is not equal to zero The slope is equal to zero
m
We can draw inference about ½1 from b1 by testing
HÔ: ½1 = Ô
H1: ½1 = Ô (or < Ô,or > Ô)
± The test statistic is
b 1 r ½1 s
tÊ where s b1 Ê
s b1 (n r 1) s 2x
The standard error of b1.
± If the error variable is normally distributed, the statistic is

Student t distribution with d.f. = n-2.
mD
Solution
± Solving by hand
± To compute ³t´ we need the values of b1 and sb1.
b1 Ê r.312
s 151.6
s b1 Ê Ê Ê r.ÔÔ231
(n r 1) s 2x (99)( °3,528,688
b1 r 1 r .312 r Ô There is o erwhe ming e i ence to in er
tÊ Ê Ê r13.°9
s b1 .ÔÔ231 th t the o ometer re ing ects the
uction se ing price
± Using the computer
oe icients t n r rror t t t ue
ntercept 0 0
ometer 00 0 00 0
m[
Coefficient of determination
± When we want to measure the strength of the linear
relationship, we use the coefficient of determination.

m r
á á r á

± To understand the significance of this coefficient
note:
The regression model

verall variability in y
The error
m
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y1
x1 x2
Total variation in = Variation explained b the + Unexplained variation error
regression line
Ê

Variation in y = SSR + SSE
R2 measures the proportion of the variation in y

that is explained by the variation in x.
2
R 1r
SSE

( y i r y ) 2 r SSE

SSR
(y i r y) 2
(y r y)
i
2
(y i r y) 2
R2 takes on any value between zero and one.

R2 = 1: erfect match between the line and the data points.
R2 = Ô: There are no linear relationship between x and y.
Ô
xamp e 17.°
± F nd the oeff ent of dete m nat on fo examp e 17.1;
what doe th tat t te you about the mode ?
o ut on
2 [ o ( , )]2 [ r1,356,256 ] 2
± o ng by hand; ? Ê 2 2
Ê ( °3,528,688)( 6° ,999)
Ê .65Ô1
x y
± U ng the ompute
F om the eg e on output we ha e
?eg e on tat t
utpe ? Ô.8Ô63 65% of the variation in the auction
? qua e Ô.65Ô1 selling price is explained by the
dju ted ? qua e Ô.6°66 variation in odometer reading. The
rest (35%) remains unexplained by
tanda d o 151.57 this model.
b e at on 1ÔÔ °
17.6 Finance Application: Market Model
ne of the most important applications of linear

regression is the market model.
It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the overall
market.
R = ½Ô + ½1Rm +
Rate of return on a particular stock Rate of return on some ma or stock index
The beta coefficient measures how sensitive the stock¶s rate
of return is to changes in the level of the overall market.
¦
Example 17.5 The market model
6
Estimate the market model for
?
ortel, a stock traded in the

Toronto Stock Exchange.
6
Data consisted of monthly
6! percentage return for ortel
"#$
and monthly percentage return
%& for all the stocks.
This is a measure of the stock¶s This is a
measure of the total risk embedded

market'$
related risk. In this sample,
in the ortel stock,! !(
that is market-related.

for each 1% increase in the TSE!

Specifically, 31.37% of the variation in ortel¶s
$
)
return, the average increase in return are explained by the variation in the
ortel¶s return is .8877%.
TSE¶s returns.

*+ ! !! !))
6 !) ) !(
-
17.7 Using the Regression Equation
Before using the regression model, we need to
assess how well it fits the data.
If we are satisfied with how well the model fits
the data, we can use it to make predictions for y.
Illustration
± redict the selling price of a three-year-old Taurus
Taurus
with °Ô,ÔÔÔ miles on the odometer (Example 17.1).
yÖ 6533 r .Ô312 x 6533 r .Ô312( °Ô,ÔÔÔ) 5,285

rediction interval and confidence interval
± Two intervals can be used to discover how closely
the predicted value will match the true value of y.
rediction interval - for a particular value of y,
Confidence interval - for the expected value of y.
± The prediction interval ± The confidence interval
1 (x g r x)2 1 (x g r x)2
yÖ
t 2 s 1 yÖ
t 2 s
n (x i r x)2 n (x i r x)2
The prediction interval is wider than the confidence interval

D
Example 17.6 interval estimates for the car
auction price
± rovide an interval estimate for the bidding price on a
Ford Taurus with °Ô,ÔÔÔ miles on the odometer.
± Solution
The dealer would like to predict the price of a single car
The prediction interval(95%) =

1 (x g r x)2
yÖ
t 2 s 1
n (x i r x)2
t.Ô25,98
1 ( °Ô,ÔÔÔ r 36,ÔÔ9) 2
[ 6533 r .Ô312( °ÔÔÔÔ)]
1.98°(151.6) 1 Ê 5,285
3Ô3
1ÔÔ
°,3Ô9,3°Ô,16Ô [
± The car dealer wants to bid on a lot of 25Ô Ford
Tauruses, where each car has been driven for about
°Ô,ÔÔÔ miles.
± Solution
The dealer needs to estimate the mean price per car.
1 (x g r x)2
The confidence interval (95%) = yÖ t 2 s
n (x i r x)2
1 ( °Ô ,ÔÔÔ r 36 ,ÔÔ9 ) 2
[ 6533 r .Ô312 ( °ÔÔÔÔ )] 1.98° (151 .6) 5,285 35
1ÔÔ ° ,3Ô9 ,3°Ô ,16Ô
Ô
The effect of the given value of x on the interval
± As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
yÖ Ê b Ô b 1 x g 1 ( x r x ) 2
The yÖconfidence

t 2 s interval

g
when xg =nx
(x r x)2
i
yÖ( x g Ê x 1)
yÖ( x g Ê x r 1) The yÖconfidence 1 12

t 2 s interval

when xg = xn
1
( x i r x )2
x r 2 x r1 x 1 x 2 The Öconfidence interval

1 22
x y
t 2s
when xg = xn
2
( x( xr r2)1)rrxxÊÊrr21 ( x 12))rr xx ÊÊ12 ( x i r x )2
Ôm
17.8 Coefficient of correlation
The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
± If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
± If r = Ô there is no linear pattern.
The coefficient can be used to test for linear
relationship between two variables.
Ô
Testing the coefficient of correlation
± When there are no linear relationship between two
variables, Ë = Ô.
± The hypotheses are:
HÔ: Ë = Ô
H1: Ë = Ô
± The test statistic is:
The statistic is Student t distributed
nr2
tÊr ith d.f. = n - 2, provided the variables
1r r2 are bivariate nor ally distributed.
here r is the sa ple coefficien t of correlation

cov( , )
calculated by r Ê
sxsy ÔÔ
Example 17.7 Testing for linear relationship
± Test the coefficient of correlation to determine if linear

relationship exists in the data of example 17.1.
Solution
± We test HÔ: Ë = Ô The value of the t statistic is
H1: Ë Ô. nr2
tÊr Ê r13.°9
± Solving by hand: 2
1r r
The re ection region is Conclusion: There is sufficient
|t| > t/2,n-2 = t.Ô25,98 = 1.98° or so. evidence at = 5% to infer that
The sample coefficient of there are linear relationship
between the two variables.
correlation r=cov(X,Y)/sxsy=-.8Ô6
Ô°
Spearman rank correlation coefficient
± The Spearman rank test is used to test whether

relationship exists between variables in cases where
at least one variable is ranked, or
both variables are quantitative but the normality
requirement is not satisfied
Ô¦
± The hypotheses are:
HÔ: Ës = Ô
H1: Ës = Ô
± The test statistic is
cov(a, b)
rs Ê
s a sb
a and b are the ranks of the data.
± For a large sample (n > 3Ô) rs is approximately
normally distributed
z Ê rs n r 1
Ô-
Example 17.8
± A production manager wants to examine the

relationship between
aptitude test score given prior to hiring, and
performance rating three months after starting work.
± A random sample of 2Ô production workers was
selected. The test scores as well as performance
rating was recorded.
Ô
R i

i Si

± Th b b iv i

z h

ihi bw

w v
i
b

±

i i

Scores range from Ô to 1ÔÔ Scores range from 1 to 5
k
± Th h h
: ± The test statistic is rs,
HÔ: Ës = Ô and the re ection region
H1: Ës = Ô is |rs| > rcritical (taken from
the Spearman rank
correlation table). ÔD
Rpt tude erfor ane
p oee tet an a rat ng an
1 59 9 3 1 .5
2 °7 3 2 3.5 T e are ro en
3 58 8 ° 17 averag ng the
° 66 1° 3 1 .5 ran .
5 77 2 2 3.5
. . . . .
. . . . .
. . . . .
± Conclusion:
So v ng hand
Do notanre ect
eahthevar a hypothesis.
null e eparate .At 5% significance
level
Cathere
isteinsufficient
= 5.92; evidence to infer that the
=5.5 ; ov a, = 12.3°
two variable
Thu r =areovrelated
a,/[toone another.
] = .379.
The r t va ue for = .Ô5 and n = 2Ô is .°5Ô. Ô[
17.9 Regression Diagnostics - I
The three conditions required for the validity of
the regression analysis are:
± the error variable is normally distributed.
± the error variance is constant for all values of x.
± The errors are independent of each other.
How can we diagnose violations of these
conditions?
°
Residual Analysis
± Examining the residuals (or standardized residuals),

we can identify violations of the required conditions
± Example 17.1 - continued
onnormality.
± Use Excel to obtain the standardized residual histogram.
± Examine the histogram and look for a bell shaped diagram with
mean close to zero.
°m
! ! A artial list of For each residual we calculate
Standard residuals the standard deviation as follows:
ser i n esi s n r esi s
1 2 sri s 1 r hi where
2 2 2 1 1
" #" 21 21 1 ( x i r x)2
hi
22 2" " 1 "1 " 12 n
( x r x)2
2 " 1 1 1 2
Standardized residual i =
Residual i / Standard deviation

We can also apply the Lilliefors test
or the i2 test of normality.

#
°
Heteroscedasticity
± When the requirement of a constant variance is

violated we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + ^y
+ + ++ +
+ + +
+ + ++
+ + ++
The spread increases with ^y

°Ô
When the requirement of a constant variance is
not violated we have homoscedasticity.
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + + ++ +
+ + + ^y +++
+ + + ++ +
+ + + + ++
+
+ +++++
+
The spread of the data points
does not change much.
°°
When the requirement of a constant variance is
not violated we have homoscedasticity.
++
^y ++ +
++ ++
Residual
+ +++
+ + +++ +
+ +++
+ + + +
+ ++ +
+ + +
+ ++
+ + + ^y + ++
+ + +
+ + +
+ + + ++ +
+ ++
+ ++
As far as the even spread, this is
a much better situation
°¦
onindependence of error variables
± A time series is constituted if data were collected

over time.
± Examining the residuals over time, no pattern should
be observed if the errors are independent.
± When a pattern is detected, the errors are said to be
autocorrelated.
± Autocorrelation can be detected by graphing the
residuals against time.
°-
atterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual Residual
+
+ ++
+
+ + +
+ + +
Ô + Ô + +
+ Time Time
+ + + + + +
+ + + +
+ +
+
ote the runs of positive residuals, ote the oscillating behavior of the
replaced by runs of negative residuals residuals around zero.
°
utliers
± An outlier is an observation that is unusually small or
large.
± Several possibilities need to be investigated when an
outlier is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
± Identify outliers from the scatter diagram.
± It is customary to suspect an observation is an outlier if
its |standard residual| > 2 °D
An outlier An influential observation
+++++++++++
+ +
+ « but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+
The outlier causes a shift

in the regression line
°[
rocedure for regression diagnostics
± Develop a model that has a theoretical basis.

± Gather data for the two variables in the model.
± Draw the scatter diagram to determine whether a
linear model appears to be appropriate.
± Check the required conditions for the errors.
± Assess the model fit.
± If the model fits the data, use the regression
equation.
¦

Regression 3D

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Regression 3D

Încărcat de

Drepturi de autor:

Formate disponibile

Chapter 17

Simple Linear Regression

The intercept is bÔ = 6533. This is the slope of the line.

 The error is a critical part of the regression model.

but the mean value changes with x E(y|x1)

± This statistic plays a role in every statistical

Linear relationship. o linear relationship.

± If the error variable is normally distributed, the statistic is

The regression model

 R2 measures the proportion of the variation in y

 R2 takes on any value between zero and one.

 ne of the most important applications of linear

for each 1% increase in the TSE! 

The prediction interval is wider than the confidence interval

 The prediction interval(95%) =

x r 2 x r1 x 1 x 2 The Öconfidence interval

here r is the sa ple coefficien t of correlation

± Test the coefficient of correlation to determine if linear

± The Spearman rank test is used to test whether

± A production manager wants to examine the

± Examining the residuals (or standardized residuals),

± When the requirement of a constant variance is

The spread increases with ^y

± A time series is constituted if data were collected

The outlier causes a shift

± Develop a model that has a theoretical basis.

S-ar putea să vă placă și

The error is a critical part of the regression model.

Linear relationship. o linear relationship.

R2 measures the proportion of the variation in y

R2 takes on any value between zero and one.

ne of the most important applications of linear

for each 1% increase in the TSE!

The prediction interval(95%) =