Sunteți pe pagina 1din 9

SIMPLE LINEAR REGRESSION PART 2

Topics Outline
Scatterplots and Correlation
The Least Squares Regression Line
Example 1
Sales versus Promotions at Pharmex
Pharmex is a chain of drugstores that operate around the country. To see how effective its
advertising and other promotional activities are, the company has collected data from 50
randomly selected metropolitan regions. In each region it has compared its own promotional
expenditures and sales to those of the leading competitor in the region over the past year.
The data are listed in the file Drugstore_Sales.xlsx.
There are two variables:
Promote: Pharmexs promotional expenditures as a percentage of those of the leading competitor
Sales: Pharmexs sales as a percentage of those of the leading competitor
Note that each of these variables is an index, not a dollar amount. For example, if Promote equals 95
for some region, this tells us only that Pharmexs promotional expenditures in that region are 95%
as large as those for the leading competitor in that region.
The company expects that there is a positive relationship between these two variables,
so that regions with relatively larger expenditures have relatively larger sales.
However, it is not clear what the nature of this relationship is.
Using StatTools
=================

Define a StatTools data set


Click anywhere within the data set
Data Set Manager
OK
Get a scatterplot
Summary Graphs
Scatterplot
Choose X and Y
OK
Get a regression output
Regression and Classification
Regression
Fill in the resulting dialog box as
shown to the right
OK

-1-

Using Excel
=============

Getting a regression output in Excel:


Data
Data Analysis, Regression, OK
Input Y Range, Input X Range
Check Labels, Line Fit Plots
OK
Formatting the scatterplot:
Delete unwanted information (right-click, delete)
Right-click, Add Trendline
Check Display Equation on chart
Display R-squared value on chart
Close
Formatting the axes:
Right-click
Format axis
Enter desired values
Close
Note 1:
If the Analysis ToolPack is not installed in Excel, follow these steps:
Click the arrow in the upper left corner
More commands
Add-Ins
Analysis ToolPack
Go
Analysis ToolPack
OK
Note 2:
The slope and intercept of the least squares line can also be calculated directly in Excel using the
functions SLOPE and INTERCEPT.

-2-

(a) What type of relationship, if any, is apparent from a scatterplot?


130
y = 0.7623x + 25.126
R = 0.4529

120

Sales

110
100
90
80
70
60
60

70

80

90

100

110

120

Promote

This scatterplot indicates that there is a positive relationship between Promote and Sales
the points tend to rise from bottom left to top right but the relationship is not perfect.
If it were perfect, a given value of Promote would prescribe the value of Sales exactly.
Clearly, this is not the case. For example, there are five regions with promotional values of 96
but all of them have different sales values. So the scatterplot indicates that while the variable
Promote is helpful for predicting Sales, it does not lead to perfect predictions.
(b) Can the drugstore manager conclude that larger promotional expenses cause larger sales values?
No. Unless the data are obtained in a carefully controlled experiment which is certainly not
the case here you can never be absolutely sure about causation. One reason is that you cant
always be sure which direction the causation goes. Does x cause y, or does y cause x?
Another reason is that you can almost never rule out the possibility that some other variable
is causing the variation in both of the observed variables. Although this is unlikely in this
drugstore example, it is still a possibility.
(c) Calculate and interpret the correlation between Sales and Promote.
To calculate a correlation between two variables, you can use Excels CORREL function or use
the value called Multiple R (with an appropriate sign) from Excels Regression Analysis output.
Alternatively, you can use StatTools to obtain a whole table of correlations between a set of
variables.
The correlation between Sales and Promote is positive as the upward-sloping scatter of points
suggests and is equal to
r = 0.673

-3-

This is a moderately large correlation. It confirms the pattern in the scatterplot, namely,
that the points increase linearly from left to right but with considerable variation around any
particular straight line.
Reminder: Correlations apply only to linear relationship. If a correlation is close to zero,
you cannot automatically conclude that there is no relationship between the two variables.
You should look at a scatterplot first. The chances are that the points are a shapeless swarm and
that no relationship exists. But it is also possible that the points cluster around some curve.
(d) What is the least squares line for sales as a function of promotional expenses at Pharmex?
In the StatTools output, the intercept and slope of the least squares line appear under the
Coefficient label in cells B18 and B19. They imply that the equation for the least squares line is
Predicted Sales = 25.1264 + 0.7623Promote
(e) Interpret the slope of the regression line.
The slope, 0.7623, indicates that the sales index tends to increase by about 0.76 for each oneunit increase in the promotional expenses index. Alternatively, if two regions are compared,
where the second region spends one unit more than the first region, the predicted sales index
for the second region is 0.76 larger than the sales index for the first region.
(f) Interpret the intercept of the regression line.
The intercept is literally the predicted sales index for a region that does no promotions.
However, no region in the sample has anywhere near a zero promotional value.
Note: In many applications it makes no sense to have the explanatory variable(s) equal to zero.
Then the intercept term has no practical or economic meaning. Therefore, in such situations,
where the range of observed values for the explanatory variable does not include zero, it is best
to think of the intercept term as simply an anchor for the least squares line that enables
predictions of y values for the range of observed x values.
(g) Interpret the coefficient of determination.
The coefficient of determination r 2 is the square of the correlation between the observed y
values and the fitted y values. Aside from rounding, the square of r = 0.673 is 0.453,
which is shown as the R 2 value in the Excel output.
r2 = 0.453
The explanatory variable Promote is able to explain only 45.3% of the variation in the Sales
variable. This is not particularly good. There is still 54.7% of the variation left unexplained.
Of course, we would like r 2 to be as close to 1 as possible. Usually, the only way to increase
it is to use better and/or more explanatory variables.
-4-

Note:
If the correlation between two variables y and x is 0.8, the regression of y on x will have an
r 2 of 0.64; that is, the regression with x as the only explanatory variable will explain 64% of
the variation in y.
If the correlation drops to 0.7, this percentage drops to 49%; if the correlation increases to 0.9,
the percentage increases to 81%. The point is that before a single variable x can explain a
large percentage of the variation in some other variable y, the two variables must be highly
correlated in either a positive or negative direction.
(h) What is the standard error of estimate? What does it measure?
The standard error of estimate is approximately s e = 7.39. It indicates the typical magnitude
of error when using promotional expenses, via the regression equation, to predict sales.
More specifically, if the regression equation is used to predict sales for many regions,
about two-thirds of the predictions will be within s e = 7.39 of the actual sales values,
and about 95% of the predictions will be within two standard errors, or 2 s e = 14.78,
of the actual sales values.
(i) Is this level of accuracy good?
One way to measure the regression equations ability to predict is to compare the standard
error of estimate, s e , to the standard deviation of the response variable, s y . The idea is that
s e is (essentially) the standard deviation of the residuals, whereas s y is the standard deviation

of residuals from a horizontal regression line at height y , the sample mean of the response
variable. Therefore, if s e is small compared to s y (that is, if s e / s y is small), the regression
line is evidently doing a good job in explaining the variation of the response variable.
The standard deviation of the Sales variable is s y = 9.90. (This is obtained by the usual
STDEV function applied to the observed sales values y.)
It can be interpreted as the standard deviation of the residuals around a horizontal line
positioned at the mean value y of Sales. This is the relevant regression line if there are no
explanatory variables that is, if Promote is ignored. In other words, it is a measure of the
prediction error if the sample mean y of Sales is used as the prediction for every region and
Promote is ignored.
Unfortunately, the standard error of estimate, s e = 7.39, is not much less than s y = 9.90.
This means that the Promote variable adds a relatively small amount to prediction accuracy.
Predictions with it are not much better than predictions without it. A standard error of estimate
well below 9.90 would certainly be preferred.

-5-

(j) If the expenditure index for a given region is 95, what would you predict this regions sales
index to be?
Predicted Sales = 25.1264 + 0.7623Promote
= 25.1264 + 0.7623(95) = 97.5449 98
(k) Find and interpret the residual value for region six.
Residual = observed Sales predicted Sales
= 103 97.5449 = 5.4551 5
The sales index for region six was about 5 points higher than we would expect for a region with
an expenditure index of 95.
(l) What does the residual plot show?

A scatterplot of residuals (on the vertical axis) versus fitted values is a useful graph in almost
any regression analysis. You typically examine residual plots for any striking patterns.
A good fit not only has small residuals, but it has residuals scattered randomly around zero
with no apparent pattern. This appears to be the case for the Pharmex data.

-6-

Example 2
Overhead Costs at Bendrix
The Bendrix Company manufactures various types of parts for automobiles. The manager of the
factory wants to get a better understanding of overhead costs. These overhead costs include
supervision, indirect labor, supplies, payroll taxes, overtime premiums, depreciation, and a number
of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.
Some of these overhead costs are fixed in the sense that they do not vary appreciably with the volume
of work being done, whereas others are variable and do vary directly with the volume of work.
The fixed overhead costs tend to come from the supervision, depreciation, and miscellaneous
categories, whereas the variable overhead costs tend to come from the indirect labor, supplies,
payroll taxes, and overtime categories. However, it is not easy to draw a clear line between the
fixed and variable overhead components.
The Bendrix manager has tracked total overhead costs for the past 36 months. To help explain
these, he has also collected data on two variables that are related to the amount of work done at
the factory (see Overhead_Costs.xlsx):
MachHrs: number of machine hours used during the month
ProdRuns: the number of separate production runs during the month
The first of these is a direct measure of the amount of work being done. To understand the second,
we note that Bendrix manufactures parts in large batches. Each batch corresponds to a production
run. Once a production run is completed, the factory must set up for the next production run.
During this setup there is typically some downtime while the machinery is reconfigured for the
part type scheduled for production in the next batch. Therefore, the manager believes that both of
these variables could be responsible (in different ways) for variations in overhead costs.
(a) Construct scatterplots to examine the relationships between each potential explanatory variable
(MachHrs and ProdRuns) and the dependent variable (Overhead).
Note: Because Overhead, MachHrs, and ProdRuns are time series variables, we should also be
on the lookout for any relationships between these variables and the Month variable.
That is, we should also investigate any time series behavior in these variables.
Here are the two scatterplots of interest:

Scatterplot of Overhead vs ProdRuns

120000

120000

110000

110000
Overhead

Overhead

Scatterplot of Overhead vs MachHrs

100000
90000

100000
90000

80000

80000

70000
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900

70000
10

15

20

25

30

35

40

ProdRuns

MachHrs

-7-

45

50

55

60

These scatterplots show that Overhead tends to increase as either MachHrs increases or
ProdRuns increases. However, both relationships are far from perfect.
With time series data, as we have in this example, there is always the possibility that time
itself is an explanatory variable. So it's a good idea to create one or more time series graphs.
One of these, the time series graph for Overhead, is shown below.
Time Series of Overhead
140000
120000
100000
80000
60000
40000
20000
35

33

31

29

27

25

23

21

19

17

15

13

11

Month

It indicates a fairly random pattern through time, with no apparent upward trend or other
obvious time series pattern. You can check that time series graphs of the MachHrs and
ProdRuns variables also indicate no obvious time series patterns.
Finally, when there are multiple explanatory variables, we should check for relationships
among them. The scatterplot of MachHrs versus ProdRuns appears below.
(Either variable could be chosen for the vertical axis.)
Scatterplot of MachHrs vs ProdRuns
1800
1700

MachHrs

1600
1500
1400
1300
1200
1100
1000
10

15

20

25

30

35

40

45

50

55

60

ProdRuns

This cloud of points indicates no relationship worth pursuing.


In summary, the Bendrix manager should continue to explore the positive relationship between
Overhead and each of the MachHrs and ProdRuns variables. However, none of the variables
appears to have any time series behavior, and the two potential explanatory variables do not
appear to be related to each other.
-8-

(b) Calculate and interpret the correlation coefficients between the three pairs of variables.
The scatterplots for the Bendrix manufacturing data indicate moderately large positive
correlations, 0.632 and 0.521, between Overhead and MachHrs and between Overhead and
ProdRuns. However, the correlation between MachHrs and ProdRuns, 0.229, is quite small
and indicates almost no relationship between these two variables.
(c) What are the least squares lines for regressing overhead expenses against machine hours and
against production runs?
The two least squares lines are:
Predicted Overhead = 48621 + 34.7 MachHrs
Predicted Overhead = 75606 + 655.1 ProdRuns
Clearly, these two equations are quite different, although each effectively breaks Overhead into
a fixed component and a variable component. The first equation implies that the fixed
component of overhead is about $48,621. Bendrix can expect to incur this amount even if zero
machine hours are used. The variable component is the 34.7MachHrs term. It implies that the
expected overhead increases by about $35 for each extra machine hour.
The second equation, on the other hand, breaks overhead down into a fixed component of
$75,606 and a variable component of about $655 per each production run.
The difference between these two equations can be attributed to the fact that neither tells the
whole story. If the manager's goal is to split overhead into a fixed component and a variable
component, the variable component should include both of the measures of work activity
(and maybe others) to give a more complete explanation of overhead. We will explain how to
do this when this example is reanalyzed with multiple regression.
(d) Use the standard error of estimate to judge which of the two potential regression equations is
more useful for predicting the Overhead.
In general, the standard error of estimate indicates the level of accuracy of predictions made
from the regression equation. The smaller it is, the more accurate predictions tend to be.
We estimated two regression lines, one using MachHrs and one using ProdRuns. Their
standard errors are approximately $8,585 and $9,457. These imply that MachHrs is a slightly
better predictor of overhead. The predictions based on MachHrs will tend to be slightly more
accurate than those based on ProdRuns. Of course, the predictions based on both predictors
should yield even more accurate predictions, as you will see when we discuss multiple
regression for this example.
(e) Interpret the coefficients of determination for the two regression lines.
The r 2 values using MachHrs and ProdRuns as single explanatory variables are 39.9% and 27.1%,
respectively. These provide one more piece of evidence that MachHrs is a slightly better predictor
of Overhead than ProdRuns. Of course, they also suggest that the percentage of variation of
Overhead explained could be increased by including both variables in a single equation.
-9-

S-ar putea să vă placă și