Documente Academic
Documente Profesional
Documente Cultură
Topics Outline
Explanatory and Response Variables
Interpreting Scatterplots
Correlation
The Least Squares Regression Line
Explanatory and Response Variables
Regression analysis provides us with a regression equation describing the nature of the
relationship between two (or more) variables. In addition, regression analysis supplies variance
measures which allow us to access the accuracy with which the regression equation can predict
values on the response variable.
Example 1 (Car plant electricity usage)
The manager of a car plant wishes to investigate how the plants electricity usage depends upon
the plants production, based on the data for each month of the previous year:
x
y
Month
Production Electricity usage
($ million)
(million kWh)
January
4.51
2.48
February
3.58
2.26
March
4.31
2.47
April
5.06
2.77
May
5.64
2.99
June
4.99
3.05
July
5.29
3.18
August
5.83
3.46
September
4.70
3.03
October
5.61
3.26
November
4.90
2.67
December
4.20
2.53
Questions: 1. How are these two data sets related?
2. Given an observation for the variable x, can we predict the value of the variable y?
y is called response (dependent, target, criterion) variable.
The response variable measures an outcome of a study.
x is called explanatory (independent, predictor, regressor) variable.
The explanatory variable explains or influences changes in the response variable.
There is both simple and multiple regression. In simple regression, we have one explanatory variable x.
In the case of multiple regression, we work with several explanatory variables x1 , x 2 , K , x n .
-1-
Interpreting Scatterplots
The easiest way to see how two numerical variables are related is to consider their scatterplot.
Typically the explanatory variable is plotted on the x axis and the response variable is plotted on
the y axis. Below is the scatter plot of our data:
3.50
3.25
3.00
2.75
2.50
2.25
2.00
3.5
4.0
4.5
5.0
5.5
6.0
Production ($ million)
r=
n 1 i =1 s x s y
That is, the correlation is an average of the products of the standardized values of each pair (x, y)
in the data set.
-2-
Example 2
The scatterplots in the figure to the right
illustrate how values of r closer to 1 or 1
correspond to stronger linear
relationships.
In general, it is not so easy to guess the
value of r from the appearance of a
scatterplot. Remember that changing the
plotting scales in a scatterplot may
mislead our eyes, but it does not change
the correlation.
-3-
Example 1 (Continued)
Consider the scatter plot of our data:
3.50
3.25
3.00
2.75
2.50
2.25
2.00
3.5
4.0
4.5
5.0
5.5
6.0
Production ($ million)
Clearly, the data points do not lie on a straight line, but they appear to cluster about a straight line,
which suggests a linear relationship between x and y. We want to fit a straight line to the data points.
However, there are an infinite number of possible lines y = a + bx , differing in slope b and/or
y-intercept a , that could be drawn through the cluster of our data points.
The linear least squares fitting technique is the simplest and most commonly applied form of
linear regression. The linear least-squares regression line (also known as fitted, estimated,
predicted line) is the line
y = a + bx
that makes the sum of the squares of the vertical distances of the data points ( xi , y i ) from the
line as small as possible. It can be shown that the values of a and b that minimize the sum of the
squared vertical distances are given by
sy
b=r
sx
a = y bx
where
x and y are the means of variables x and y;
s x and s y are the standard deviations of variables x and y;
r is the correlation coefficient for variables x and y.
-4-
Example 1 (Continued)
For our example,
x = 4.885
y = 2.846
s x = 0.6655
s y = 0.3707
r = 0.8956
sy
sx
= 0.8956
0.3707
= (0.8956)(0.5570) = 0.49883 0.499
0.6655
y = 0.409 + 0.4988x
3.25
3.00
2.75
2.50
2.25
2.00
3.5
4.0
4.5
5.0
5.5
6.0
Production ($ m illion)
and it is not a coincidence. The regression line always passes through the point ( x , y ). Why?
3. If we reverse the roles of x and y, we get a different least-squares regression line (see Figure 3).
-5-
y = 0.309 + 1.608x
Production ($ million)
6.00
5.50
5.00
4.50
4.00
3.50
2.2
2.4
2.6
2.8
3.2
3.4
3.6
Making Predictions
The regression line can be used to predict response values (ys) at one or more values of the
explanatory variable x within the range studied. This is called interpolation.
If a production level of $5.5 million worth of cars is planned for next month, then the plant
manager can predict that the electricity usage will be
y = 0.409 + (0.499 )(5.5) = 3.1535
We must be cautioned, though, against applying this equation for values of x which are beyond
those used to develop the equation (that is, below 3.5 and above 6), for the relationship may not
be linear for those values of x.
The use of a regression line for predictions outside the range of the data from which the line was
calculated is called extrapolation. Such predictions are often not accurate and should be avoided.
-6-
Residuals
A residual is the difference between an observed value of y and the value of y predicted by the
regression line:
residual = (observed y) (predicted y) = y y
The observed y for the first x = 4.51 in our data set is 2.48.
The predicted y for x = 4.51 is y = 0.409 + 0.4988x = 0.409 + 0.4988(4.51) = 2.66
The residual for this observation is
residual = (observed y) (predicted y) = y = 2.48 2.66 = 0.18
Thus, the observed electricity usage for the first month lies 0.18 million kWh below the leastsquares line on the scatterplot.
If we repeat this calculation eleven more times, we will get all the residuals:
Observation
Residual
10
11
12
0.18
0.07
0.09
0.16
0.23
0.15
0.13
0.14
0.28
0.05
0.18
0.03
It can be shown that the mean of the least-squares residuals is always zero.
The standard deviation of the residuals, denoted by s or s e , is given by the following equation
s=
1 n
(residuali 0) 2 =
n 2 i =1
1 n
( yi y i ) 2
n 2 i =1
and is referred to as the regression standard error (or standard error of estimate).
Note that the squared residuals are averaged by dividing by n 2 and not by the usual n 1.
The rule is to subtract the number of parameters being estimated from the sample size n to obtain
the denominator. Here there are two parameters being estimated: the intercept and the slope.
Since you usually want your forecasts and predictions to be as accurate as possible, you would
be glad to find a small value for s e . We judge the value of s e by comparing it to the values of the
response variable y or more specifically to the sample mean y . Because in our example,
y = 2.85 million kWh
and
it does appear that the standard error of estimate is small. This tells you that, for a typical month,
the actual electricity usage was different from the predicted electricity usage (on the least squares
line) by about 0.17 million kWh.
If the residuals are approximately normally distributed, the 68% 95% 99.7% empirical rule for
standard deviations can be applied to the standard error of estimate. For example, approximately
68% (or about two-thirds) of the residuals are typically within one standard error of their mean
(which is zero). Stated another way, about 68% (or two-thirds) of the observed y values are
typically within a distance s e either above or below the regression line. Similarly, about 95% of
the observed y values are typically within 2 s e of the corresponding fitted y values, and so forth.
A residual plot is a scatterplot of the residuals against the explanatory variable x or the predicted
values y . The horizontal line at zero residual corresponds to the fitted regression line. Ideally,
the plot of the residuals should truly show random fluctuations around the zero residual line.
-7-
Coefficient of Determination r 2
The total variation in the values of y can be decomposed into two parts explained and
unexplained variation: