Sunteți pe pagina 1din 8

SIMPLE LINEAR REGRESSION PART 1

Topics Outline
Explanatory and Response Variables
Interpreting Scatterplots
Correlation
The Least Squares Regression Line
Explanatory and Response Variables
Regression analysis provides us with a regression equation describing the nature of the
relationship between two (or more) variables. In addition, regression analysis supplies variance
measures which allow us to access the accuracy with which the regression equation can predict
values on the response variable.
Example 1 (Car plant electricity usage)
The manager of a car plant wishes to investigate how the plants electricity usage depends upon
the plants production, based on the data for each month of the previous year:
x
y
Month
Production Electricity usage
($ million)
(million kWh)
January
4.51
2.48
February
3.58
2.26
March
4.31
2.47
April
5.06
2.77
May
5.64
2.99
June
4.99
3.05
July
5.29
3.18
August
5.83
3.46
September
4.70
3.03
October
5.61
3.26
November
4.90
2.67
December
4.20
2.53
Questions: 1. How are these two data sets related?
2. Given an observation for the variable x, can we predict the value of the variable y?
y is called response (dependent, target, criterion) variable.
The response variable measures an outcome of a study.
x is called explanatory (independent, predictor, regressor) variable.
The explanatory variable explains or influences changes in the response variable.
There is both simple and multiple regression. In simple regression, we have one explanatory variable x.
In the case of multiple regression, we work with several explanatory variables x1 , x 2 , K , x n .

-1-

Interpreting Scatterplots
The easiest way to see how two numerical variables are related is to consider their scatterplot.
Typically the explanatory variable is plotted on the x axis and the response variable is plotted on
the y axis. Below is the scatter plot of our data:

Electricity usage (million kWh)

3.50
3.25
3.00
2.75
2.50
2.25
2.00
3.5

4.0

4.5

5.0

5.5

6.0

Production ($ million)

Figure 1 Scatter plot of car plant electricity usage


After plotting two variables on a scatterplot, we describe the relationship by examining the form,
direction, and strength of the association. We look for an overall pattern and striking deviations
from that pattern:
Form:
linear, curved, clusters, no pattern
Direction: positive, negative, no direction
Strength: how closely the points fit the form
Outlier: point that falls outside the overall pattern of the relationship
The form of association in our example is linear. That is, the overall pattern follows a straight line.
The direction of the association is positive. That is, high production values tend to accompany high
electricity usage values. The association is quite strong the data points do not lie on a straight line, but
they appear to cluster very closely about a straight line. There are no apparent outliers in this example.
Of course, not all relationships have a simple form and a clear direction that we can describe as
positive association or negative association. Sometimes x and y vary independently and knowing
x tells you nothing about y.
Correlation
The correlation coefficient measures the direction and strength of the linear relationship
between two numerical variables. It is calculated using the mean and the standard deviation of
both the x and y variables:
1 n xi x y i y

r=

n 1 i =1 s x s y
That is, the correlation is an average of the products of the standardized values of each pair (x, y)
in the data set.

-2-

Facts about Correlation


1. Correlation can only be used to describe numerical variables.
Categorical variables do not have means and standard deviations.
2. The value of the correlation coefficient r does not change if the explanatory and response
variables are switched.
3. Since r uses the standardized values of the observations, r does not change when we
change the units of measurement of x, y or both.
4. Positive r indicates positive association between the variables, and negative r indicates
negative association.
5. The correlation r is always a number between 1 and 1.
It is equal to 1 when the data points lie on a straight line with a downward slope,
and r is equal to +1 when the data points lie on a straight line with an upward slope.
Values of r close to 1 or 1 indicate that the points in a scatterplot lie close to a straight line.
A value of r near 0 indicates at most a weak linear relationship between the data points.
6. Correlation measures the strength of only the linear relationship between two variables.
A correlation of r = 0 means that there is no linear relationship between the data
points, although there might be a strong nonlinear relationship.
7. Correlation is not a resistant measure: like the mean and standard deviation, it is strongly
influenced by a few outlying observations.

Example 2
The scatterplots in the figure to the right
illustrate how values of r closer to 1 or 1
correspond to stronger linear
relationships.
In general, it is not so easy to guess the
value of r from the appearance of a
scatterplot. Remember that changing the
plotting scales in a scatterplot may
mislead our eyes, but it does not change
the correlation.

-3-

The Least-Squares Regression Line


In simple linear regression we suppose that there is an underlying linear relationship between the
explanatory variable x and the response variable y.

Example 1 (Continued)
Consider the scatter plot of our data:

Electricity usage (million kWh)

3.50
3.25
3.00
2.75
2.50
2.25
2.00
3.5

4.0

4.5

5.0

5.5

6.0

Production ($ million)

Clearly, the data points do not lie on a straight line, but they appear to cluster about a straight line,
which suggests a linear relationship between x and y. We want to fit a straight line to the data points.
However, there are an infinite number of possible lines y = a + bx , differing in slope b and/or
y-intercept a , that could be drawn through the cluster of our data points.
The linear least squares fitting technique is the simplest and most commonly applied form of
linear regression. The linear least-squares regression line (also known as fitted, estimated,
predicted line) is the line
y = a + bx
that makes the sum of the squares of the vertical distances of the data points ( xi , y i ) from the
line as small as possible. It can be shown that the values of a and b that minimize the sum of the
squared vertical distances are given by
sy
b=r
sx
a = y bx
where
x and y are the means of variables x and y;
s x and s y are the standard deviations of variables x and y;
r is the correlation coefficient for variables x and y.

-4-

Example 1 (Continued)
For our example,
x = 4.885

y = 2.846

s x = 0.6655

s y = 0.3707

r = 0.8956

The slope is therefore


b=r

sy
sx

= 0.8956

0.3707
= (0.8956)(0.5570) = 0.49883 0.499
0.6655

and the intercept is


a = y bx = 2.846 (0.49883)( 4.885) = 0.409

The least squares regression line is thus


y = 0.409 + 0.499 x

which is shown together with the data points in Figure 2.


3.50
Electricity usage (million kWh)

y = 0.409 + 0.4988x
3.25
3.00
2.75
2.50
2.25
2.00
3.5

4.0

4.5

5.0

5.5

6.0

Production ($ m illion)

Figure 2 Fitted regression line for car plant electricity usage


Notes:
1. The regression line does not pass through even one of the original points, and yet it is
the straight line that best approximates them.
2. The regression line passes through the point
( x , y ) = ( 4.885,2.846 )

and it is not a coincidence. The regression line always passes through the point ( x , y ). Why?
3. If we reverse the roles of x and y, we get a different least-squares regression line (see Figure 3).
-5-

y = 0.309 + 1.608x

Production ($ million)

6.00

5.50

5.00

4.50

4.00

3.50
2.2

2.4

2.6

2.8

3.2

3.4

3.6

Electricity usage (million kWh)

Figure 3 Fitted regression line if x and y are switched


Interpretation of the slope: Variable cost
How much will y increase/decrease if x increases by 1 unit?
The slope of 0.499 means for each increase of $1 million in production, the linear regression model
predicts that the electricity usage increases by 0.499 (about half a) million kilowatt-hours.
Interpretation of the intercept: Fixed cost
What is the value of y if x is equal to 0 units?
The intercept of 0.409 means that if x = 0 (that is, nothing is produced),
the model predicts that the electricity usage is 0.409 million kWh.
As the above example shows, the interpretation of the intercept in regression analysis does not
always make sense in real life. Sometimes you might get a negative value for the intercept even
though the variable y is such that it is always positive. The value of the intercept is meaningful in
real life only when the explanatory variable x can actually take values close to zero.

Making Predictions
The regression line can be used to predict response values (ys) at one or more values of the
explanatory variable x within the range studied. This is called interpolation.
If a production level of $5.5 million worth of cars is planned for next month, then the plant
manager can predict that the electricity usage will be
y = 0.409 + (0.499 )(5.5) = 3.1535

We must be cautioned, though, against applying this equation for values of x which are beyond
those used to develop the equation (that is, below 3.5 and above 6), for the relationship may not
be linear for those values of x.
The use of a regression line for predictions outside the range of the data from which the line was
calculated is called extrapolation. Such predictions are often not accurate and should be avoided.
-6-

Residuals
A residual is the difference between an observed value of y and the value of y predicted by the
regression line:
residual = (observed y) (predicted y) = y y
The observed y for the first x = 4.51 in our data set is 2.48.
The predicted y for x = 4.51 is y = 0.409 + 0.4988x = 0.409 + 0.4988(4.51) = 2.66
The residual for this observation is
residual = (observed y) (predicted y) = y = 2.48 2.66 = 0.18
Thus, the observed electricity usage for the first month lies 0.18 million kWh below the leastsquares line on the scatterplot.
If we repeat this calculation eleven more times, we will get all the residuals:
Observation
Residual

10

11

12

0.18

0.07

0.09

0.16

0.23

0.15

0.13

0.14

0.28

0.05

0.18

0.03

It can be shown that the mean of the least-squares residuals is always zero.
The standard deviation of the residuals, denoted by s or s e , is given by the following equation

s=

1 n
(residuali 0) 2 =
n 2 i =1

1 n
( yi y i ) 2
n 2 i =1

and is referred to as the regression standard error (or standard error of estimate).
Note that the squared residuals are averaged by dividing by n 2 and not by the usual n 1.
The rule is to subtract the number of parameters being estimated from the sample size n to obtain
the denominator. Here there are two parameters being estimated: the intercept and the slope.
Since you usually want your forecasts and predictions to be as accurate as possible, you would
be glad to find a small value for s e . We judge the value of s e by comparing it to the values of the
response variable y or more specifically to the sample mean y . Because in our example,
y = 2.85 million kWh

and

s e = 0.17 million kWh

it does appear that the standard error of estimate is small. This tells you that, for a typical month,
the actual electricity usage was different from the predicted electricity usage (on the least squares
line) by about 0.17 million kWh.
If the residuals are approximately normally distributed, the 68% 95% 99.7% empirical rule for
standard deviations can be applied to the standard error of estimate. For example, approximately
68% (or about two-thirds) of the residuals are typically within one standard error of their mean
(which is zero). Stated another way, about 68% (or two-thirds) of the observed y values are
typically within a distance s e either above or below the regression line. Similarly, about 95% of
the observed y values are typically within 2 s e of the corresponding fitted y values, and so forth.
A residual plot is a scatterplot of the residuals against the explanatory variable x or the predicted
values y . The horizontal line at zero residual corresponds to the fitted regression line. Ideally,
the plot of the residuals should truly show random fluctuations around the zero residual line.
-7-

Coefficient of Determination r 2
The total variation in the values of y can be decomposed into two parts explained and
unexplained variation:

Figure 4 Decomposition of total variation


The coefficient of determination r 2 is the square of the correlation coefficient r.
It measures the fraction of the variation in the values of y that can be explained by ys linear dependence
on x in the regression model. The idea is that when there is a linear relationship, some of the variation
in y is accounted for by the fact that as x changes it pulls y with it along the regression line.
This coefficient always lies between 0 and 1. A value of r 2 near 1 indicates that changes in x
explain almost 100% of the variations in y and therefore the regression equation is extremely
useful for making predictions. A value of r 2 near 0 indicates that the amount of unexplained
variation in the regression model is big in relation to the explained variation. In this case, we
should be cautious when using the regression equation for predictions.
For our data, r 2 = (0.8956 ) = 0.802 . Evidently, the regression equation obtained in this
example is quite useful for predicting the electricity usage because about 80% of the variability
in the electricity usage can be explained by changes in the production levels.
2

Outliers and Influential Observations


Recall that an outlier is an observation that lies outside the overall pattern of the other observations.
An observation is called influential if removing it would significantly change the equation of the
regression line. Points that are outliers in the x direction are often (but not always) influential.

Caution about Correlation and Regression


Association does not imply causation!
The observation that two variables tend to vary simultaneously in the same direction does not
imply a direct relationship between them. It would be not surprising, for example, to obtain a high
positive correlation between the annual sales of chewing gum and the incidence of crime in cities
of various sizes within the United States, but one cannot conclude that crime might be reduced by
prohibiting the sale of chewing gum. Both variables depend upon the size of the population, and it
is this mutual relationship with a third variable (population size) which produces the positive
correlation. This third variable, called a lurking variable, is often overlooked when mistaken
claims are made about x causing y.
-8-

S-ar putea să vă placă și