Documente Academic
Documente Profesional
Documente Cultură
Topics Outline
Review of Least Squares Regression Line
The Linear Regression Model
Confidence Intervals for the Intercept
and the Slope
Testing the Hypothesis of No Linear Relationship
Inference about Prediction
Residuals
Conditions for Regression Inference
Review of Least Squares Regression Line
In simple linear regression, we consider a data set consisting of the paired observations
( x1 , y1 ),, ( xn , yn ) . Our goal is to investigate how the two quantitative variables x and y,
corresponding to the data values x i and y i , are related. We are also interested in predicting a
future response y from information about x.
The correlation coefficient r measures the direction and strength of the linear relationship between
two quantitative variables. Values of r close to (1) or (+1) indicate a strong negative or positive
linear relationship.
The least-squares regression line of the response variable y on the explanatory variable x is the line
y a bx
that minimizes the sum of the squares of the vertical distances of the data points ( xi , yi )
from the line. The slope
sy
b r
sx
of the regression line is the rate at which the predicted response y changes along the line as the
explanatory variable x changes. Specifically, b is the change in y when x increases by 1.
The intercept of the regression line
y bx
is the predicted response when the explanatory variable x = 0. This prediction is of no statistical
interest unless x can actually take values near 0.
The coefficient of determination r 2 is the square of the correlation coefficient r.
It measures the fraction of the variation in the response variable y that is explained by the least
squares regression on the explanatory variable x.
The least squares regression line can be used to predict the value of the response variable y for a
given value of the explanatory variable x by substituting this x into the equation of the line.
-1-
Example 1
Car plant electricity usage
The manager of a car plant wishes to investigate how the plants electricity usage depends upon
the plants production, based on the data for each month of the previous year:
x
y
Production Electricity usage
($ million)
(million kWh)
January
4.51
2.48
February
3.58
2.26
March
4.31
2.47
April
5.06
2.77
May
5.64
2.99
June
4.99
3.05
July
5.29
3.18
August
5.83
3.46
September
4.70
3.03
October
5.61
3.26
November
4.90
2.67
December
4.20
2.53
Month
3.25
3
2.75
2.5
2.25
2
3.5
4.5
5.5
Production ($ million)
The scatterplot shows a positive linear relationship, with no extreme outliers or potentially
influential observations. Higher levels of production do tend to require higher levels of electricity.
0.8021 0.896 is high, indicating a strong linear
The correlation coefficient r = r 2
relationship between Production and Electricity. The equation of the least squares regression line is
a bx = 0.409 + 0.499x
Because r 2 = 0.8021, about 80% of the variation in Electricity usage is explained by Production levels.
Is the observed relationship statistically significant?
-2-
where is a random variable referred to as the error (or residual) term. The error term accounts
for the variability in y that cannot be explained by the linear relationship between x and y.
The random variable is assumed to have a mean of zero and standard deviation .
A consequence of this assumption is that the mean of y is equal to:
x
1
n 2
(residuali
0) 2
i 1
n 2
i 1
( yi
y i ) 2
and is referred to as the regression standard error (or standard error of estimate).
The regression standard error for our example is s = 0.173. (See Excel output on the last page.)
-3-
Sample Data
Regression Model
y
(
x
- st. dev. of
x y
x1 y1
x2 y2
. .
. .
. .
xn yn
Regression Parameters
The values of
Compute the
sample statistics
a, b, s
provide estimates of
a, b, s
a bx
-4-
If we did experiment many times with the same xi ' s we would get different yi ' s each time,
due to random errors. Therefore, we would also get different values for the least squares
estimators a and b of the population parameters
and . Indeed, a and b are sample statistics
that have their own sampling distributions.
Let SEa and SEb be estimates of the standard errors (i.e. standard deviations) of a and b,
respectively. It can be shown that the level C confidence intervals for the intercept
and the
slope are given by the following confidence limits:
:
a t * SEa
b t * SEb
Here t* is the critical value for the t (n 2) density curve with area C between t* and t*.
Note: All t procedures in simple linear regression have n 2 degrees of freedom.
Example 1 (Continued)
For our example (see Excel output),
a = 0.4090
b = 0.4988
SEa = 0.3860
SEb = 0.0784
Thus the management of the car plant can be 95% confident that within the range of the data set,
the mean electricity usage increases by somewhere between a third of a million kilowatt-hours
and two thirds of a million kilowatt-hours for every additional $1 million dollars of production.
-5-
is unknown and represents the slope of the true unknown regression line
x
while b is the estimate of the slope obtained by fitting a line to the data set.
Hence, we can determine the existence of a statistically significant relationship between x and y
variables by testing whether
(the true slope) is equal to 0.
The null and alternative hypotheses are stated as follows:
H0 :
Ha :
If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.
It can be shown that the test statistic is
b
t
SE b
Example 1 (Continued) To test the hypothesis
H0 :
Ha :
0.4988
0.0784
6.37
The t-Table shows that the two-sided P-value for t distribution with 10 degrees of freedom is
smaller than 0.001. (Excel gives P-value = 0.000082.)
We reject H 0 and conclude that the slope of the population regression line is not 0.
In other words, the data provide very strong evidence to conclude that the distribution of
electricity usage does depend upon the level of production.
An alternative to testing the existence of a linear relationship between x and y variables is to set
up a confidence interval for
and to determine whether the hypothesized value ( = 0 ) is
included in the interval. The 95% confidence interval for
is 0.32 to 0.67.
Because this interval does not contain 0, we conclude that there is a significant linear relationship
between x and y. Had the interval included 0, the conclusion would have been that no (linear)
relationship exists between the variables.
-6-
y *
y * t * SEmean
SEmean
x*:
y*
(x * x)2
1
n
( xi
x)2
i 1
This confidence interval expresses our uncertainty about the regression line.
If we knew
and , then we would know the regression line exactly and our confidence
interval would be one point.
2. Prediction interval for an individual (future) response y*:
y * t * SEind
SEind
s 1
1
n
(x * x)2
n
( xi
x)2
i 1
This prediction interval expresses our uncertainty about the regression line and the fact that
there are errors in the data. If we knew
and , we would know the regression line exactly,
but the length of our prediction interval would not shrink to zero, since the error term in
y* =
always has a fixed variance
-7-
x* + *
In both intervals, t* is the critical value for the t(n 2) density curve with area C between t* and t*,
and
n 2
i 1
( yi
y i ) 2
y * t * SE
However, the prediction interval is wider than the confidence interval because it is harder to
predict one individual response than to predict a mean response.
Individuals are always more variable than averages!
Excels Regression tool does not have an option for computing confidence and prediction intervals.
These intervals can be computed using formulas along with the output of the Regression tool.
Example 1 (Continued)
For our example, y * = 2.903
t* = 2.228
SEmean = 0.0507
y*
SEind = 0.1802
x * to the value x* = 5 is
or
2.79
to
3.02
This interval implies that with a monthly production of $5 million, the mean electricity usage is
between about 2.8 and 3 million kWh.
A 95% prediction interval for a future response to the value x* = 5 is
or
2.50
to
3.30
This prediction interval indicates that if next months production target is $5 million,
then with 95% confidence next months electricity usage will be somewhere between 2.5 and 3.3
million kWh.
Thus, while the expected or average electricity usage in a month with $5 million of production is
known to lie somewhere between 2.8 and 3.0 million kWh, the electricity usage in a particular
month with $5 million of production will be somewhere between 2.5 and 3.3 million kWh.
-8-
Residuals
The residuals (y ) give useful information about the contribution of individual data points to
the overall pattern of scatter. Residual values show how much the observed values differ from
the fitted values. If a particular residual is positive, the corresponding data point is above the
line; if it is negative, the point is below the line. The only time a residual is zero is when the
point lies directly on the line.
Example 1 (Continued)
There are twelve residuals:
Observation
Residual
10
11
12
0.18
0.07
0.09
0.16
0.23
0.15
0.13
0.14
0.28
0.05
0.18
0.03
We can construct a residual plot by plotting the residuals against the explanatory variable x or the
predicted (also called fitted) values y . In a residual plot, the residual = 0 line represents the
position of the least-squares line in the scatterplot of y against x. (See Excel output.)
Residual plots are the primary tool for determining whether the assumed regression model is
appropriate.
Conditions for Regression Inference
An important step in determining whether the assumed linear regression model y
x
is appropriate involves testing for the significance of the relationship between the explanatory and
response variables. The tests of significance in regression analysis are based on four assumptions
about the error term .
Figure 2 illustrates the regression model assumptions and their implications. Note that in this
graphical interpretation, the mean response y moves along a straight line as the explanatory
variable x changes. The normal curves show how the observed response y will vary when x is held
fixed at different values. All of the curves have the same standard deviation , so the variability
of y is the same for all values of x.
Here are the four conditions for regression inference, their implications and how to check if the
conditions are satisfied.
1. Linearity
Condition: The error term
Implication
Because
and
Implication
The value of for a particular value of x is not related to the value of for any other value of x.
Thus, the value of y for a particular value of x is not related to the value of y for any other value of x.
How to check
Signs of dependence in the residual plot are a bit subtle. In general, if the residual plot displays
a random pattern with no apparent trends, cycles, alternations, or clumping, it is reasonable to
conclude that the independence assumption holds.
Example 1
The residual plot shows a random variation around the residual = 0 line.
3. Normality
Condition: The error term is a normally distributed random variable
(with mean 0 and standard deviation ).
Implication
Because y is a linear function of , y is also a normally distributed random variable
x and standard deviation ).
(with mean y
How to check
Check for clear skewness or other major departures from normality in the histogram of the residuals.
Or, check if the points in the normal probability plot (Q-Q plot) are far from a 45o line.
Example 1
The histogram of the residuals does not show any important deviations from normality.
- 10 -
4. Equal spread
Condition: The standard deviation
of
Implication
The standard deviation of y about the regression line equals
values of x.
How to check
Look at the scatter of the residuals above and below the residual = 0 line in the
residual plot. The scatter should be roughly the same from one end to the other.
Example 1
The residual plot shows no unusual variation in the scatter of the residuals above and
below the line as x varies.
Example 2
The following figure shows some general patterns that might be observed in any residual plot.
Good pattern
residuals are randomly scattered.
Curved pattern
the relationship is not linear.
Change in variability
is not equal for all values of x.
- 11 -
Lower 95%
-0.450992
0.324252
Upper 95%
1.269089
0.673409
Residuals
0.10
0.00
3.5
4.5
5.5
-0.10
-0.20
-0.30
Production
Intercept
Production
Coefficients
0.409048
0.498830
2
1
0
-0.2
-0.1
0.1
Residual
- 12 -
0.2
0.3