Sunteți pe pagina 1din 25

24

Multiple Regression

SIMPLE MODELS ............................................................................................................................24-3 ERRORS IN THE SIMPLE REGRESSION MODEL ..............................................................................24-4 AN EMBARRASSING LURKING FACTOR ........................................................................................24-5 MULTIPLE REGRESSION ................................................................................................................24-6 PARTIAL AND MARGINAL SLOPES ................................................................................................24-7 PATH DIAGRAMS ...........................................................................................................................24-8 R2 INCREASES WITH ADDED VARIABLES ....................................................................................24-11 THE MULTIPLE REGRESSION MODEL .........................................................................................24-11 CALIBRATION PLOT .....................................................................................................................24-13 THE RESIDUAL PLOT ...................................................................................................................24-14 CHECKING NORMALITY ..............................................................................................................24-14 INFERENCE IN MULTIPLE REGRESSION.......................................................................................24-15 THE F-TEST IN MULTIPLE REGRESSION .....................................................................................24-17 STEPS IN BUILDING A MULTIPLE REGRESSION ..........................................................................24-19 SUMMARY ....................................................................................................................................24-22

7/27/07

24 Multiple Regression

Utilities that supply natural gas to residential customers have to anticipate how much fuel they will need to supply in the coming winter. Natural gas is difficult to store, so utilities contract with pipelines to deliver the gas as needed. The larger the supply that the utility locks in, the larger the cost for the contract. Thats OK if the utility needs the fuel, but its a waste if the contract reserves more gas than needed. On the other hand, if deliveries fall short, the utility can find itself in a tight spot when the winter turns cold. A shortage means cold houses or expensive purchases of natural gas on the spot market. Either way, the utility will have unhappy customers theyll either be cold or shocked at surcharges on their gas bills. It makes sense, then, for the utility to anticipate how much gas its customers will need. Lets focus on estimating the demand of a community of 100,000 residential customers in Michigan. According to the US Energy Information Administration, 62 million residential customers in the US consumed about 5 trillion cubic feet of gas in 2004. That works out to about 80 thousand cubic feet of natural gas per household. In industry parlance, thats 80 MCF per household. Should the utility should lock in 80 MCF for every customer? Probably not. Does every residence use natural gas for heating, or only some of them? A home that uses gas for heat burns a lot more than one that uses gas only for cooking and heating water. And what about the weather? These 100,000 homes arent just anywhere in the US; theyre in Michigan, where it gets colder than many other parts of the country. You can be sure these homes need more heating than if they were in Florida. Forecasters are calling for a typical winter. For the part of Michigan around Detroit, that means a winter with 6,500 heating degree days.1 How much gas does the utility need for the winter? How much should they lock in with contracts? As well see, the answers to these questions dont have to be the same. To answer either, we better look at data.

Heres how to compute the heating degree days. For a day with low temperature of 20 and a high of 60, the average temperature is (20+60)/2 = 40. Now subtract the average from 65. This day contributes 65 40 = 25 heating degree days to the year. If the average temperature is above 65, the day contributes nothing to the total. Its as if you only need heat if the average temperature is below 65.
1

24-2

7/27/07

24 Multiple Regression

Simple Models
The Simple Regression Model describes the variation in the response with one explanatory variable. Randomly scattered errors contribute the rest of the variation. Its likely, however, that the variation in the response depends on many factors. Consider what influences a customer to purchase an item: its packaging, promotion, and price, as well as the customers income, lifestyle, and mood. The list goes on and on. Multiple regression amplifies the power of regression modeling by permitting us to consider several explanatory variables at once. Its not for convenience either. Multiple regression is more than the sum of several simple regression models because it accounts for relationships among the explanatory variables. Lets start with a simple regression of the amount of natural gas used in homes on the number of heating degree days over the year, in thousands (MHDD). The following scatterplot shows the number of thousands of cubic feet (MCF) of natural gas used during a heating season versus heating degree days. The data are a sample of 512 homes from around the US that use natural gas. (Chapter 6 has a similar example.)
200

HDD Heating degree days measure how cold it gets. If H is the daily high temp and L the low temp, then HDD for this day is HDD = 65 - (L+H)/2 so long as this number is positive. If the average temp is above 65, this formula says you dont need any heating.

Natural Gas (MCF)

150 100 50 0 0 1 2 3 4 5 6 7 8

Heating DD (000)

Figure 24-1. Homes in colder climates use more fuel to heat.

Orange? We have to first check conditions of the model before relying on these portions of the summary.

The following tables summarize the estimated SRM. We need to verify the conditions of the SRM before using portions colored orange. R2 se n
Term b0 b1 Table 24-1. Simple regression of gas use. Estimate

0.488654 29.64422 512


Std Error t Ratio Prob>|t|

34.4272 14.5147

2.9951 0.6575

11.49 22.08

<.0001 <.0001

The fitted line estimates the annual consumption of natural gas for households that experience x thousand heating degree days to be Estimated Natural Gas (MCF) = 34.4272 + 14.5147 x 24-3

7/27/07

24 Multiple Regression

The intercept b0 implies that residences use about 34,000 cubic feet of natural gas, regardless of the weather. All of these households use gas for heating water and some use gas for cooking. To capture the effect of the weather, the slope b1 indicates that, on average, consumption increases by about 14,500 cubic feet of gas for each increase of 1 more MHDD. A utility can use this equation to estimate the average gas consumption in residences. The National Oceanic and Atmospheric Administration estimates 6,500 HDD for the winter in Michigan (compared to 700 in Florida). Plugging this value into the least squares line for x, this equation predicts annual gas consumption to be 34.4272 + 14.5147 * 6.5 128.8 MCF Once we verify the conditions of the SRM, we can add a range to convey our uncertainty.

Errors in the Simple Regression Model


The SRM says that the explanatory variable x affects the average value of the response y through a linear equation, written as

y = y | x + "
The model has two main parts.

with

y | x = " 0 + "1 x ,

(1) Linear pattern. The average value of y depends on x through the line the only way that x affects the " 0 +! "1 x. That should be! distribution of y. For example, the value of x should not affect the variance of the s (as seen in the previous chapter).
!

(2) Random errors. The errors should resemble a simple random sample ! from a normal distribution. They must be independent of each other with equal variance. Normality matters most for prediction intervals. Otherwise, with moderate sample sizes, we can rely on the CLT to justify confidence intervals. The SRM is a strange model when you think about it. Only one variable systematically affects the response. That seems too simple. Think about what influences energy consumption in your home. The temperature matters, but climate is not the only thing. How about the size of the home? It takes more to heat a big house than a small one. Other things contribute as well: the number of the windows, the amount of insulation, the type of construction, and so on. The people who live there also affect the energy consumption. How many live in the house? Do they leave windows open on a cold day? Where do they set the thermostat? A model that omits these variables treats their effects as random errors. The errors in regression represent variables that influence y that are omitted from the model. The real equation for y looks more like
y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L

What is ? The errors represent the accumulation of all of the other variables that affect the response that weve not accounted for.

24-4
!

7/27/07

24 Multiple Regression

Either we are unaware of some of these xs or, even if we are aware, we dont observe them. The SRM draws a boundary after x1 and bundles the rest into the error, y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L 1444442444443
#

= " 0 + "1 x1 + #
If the omitted variables have comparable effects on y, then the Central Limit Theorem tells us that the sum of their effects is roughly normally distributed. As a result, its not too surprising that we often find normally ! distributed residuals. If we omit an important variable whose effect stands out from the others, however, neednt be normal. Thats another reason to watch for a lack of normality in residuals. Deviations from normality can suggest an important omitted variable. A simple regression that describes the prices of cars offered for sale at a large dealer regresses Price (y) on Engine Power (x). (a) Name other explanatory variables that affect the price of the car.2 (b) What happens to the effects of these other explanatory variables if they are not used in the regression model?3

tip
AYT

An Embarrassing Lurking Factor


The best way to identify omitted explanatory variables is to know the context of your model. The simple regression for gas use says that the only thing that systematically affects consumption is temperature. The size of the home doesnt matter, just the climate. Thats an embarrassing omission. Unless these homes are the same size, size matters. Well use the number of rooms to measure the size of the houses. (This is easier to collect in surveys than measuring the number of square feet.) This scatterplot graphs gas consumption versus the number of rooms.
200

Natural Gas (MCF)

150 100 50 0 2 3 4 5 6 7 8 9 10 11

Number of Rooms

Figure 24-2. Size is related to fuel use as well.

2 3

Others include options (e.g., a sunroof or navigation system), size, and styling (e.g., convertible). Variables that affect y that are not explicitly used in the model become part of the error term.

24-5

7/27/07

24 Multiple Regression

The slope of the least squares line in the figure is positive: larger homes use more gas. R2 0.2079 se 36.90 Term b0 b1, Number of Rooms Estimate 15.8340 12.4070 Std Error t-statistic 6.9409 2.28 1.0724 11.57 p-value 0.0229 <.0001

Table 24-2. Simple regression of gas use on number of rooms.

This equation does not represent as much variation in gas use as the equation using HDD as the explanatory variable. The R2 of this fit is 21% compared to 49% for HDD. Accordingly, se is larger at 37 MCF, compared to 30 MCF for the fit using HDD. This regression is interesting, but not what we need. Its another simple view, only one that considers the effect of size rather than temperature on energy consumption. To combine these explanations for the variation in y, we need multiple regression.

Multiple Regression
Multiple regression moves the boundary between included and omitted explanatory variables to the right. Well move the boundary a little to the right and add a second explanatory variable: y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L 1444 4 24444 3
error

= " 0 + "1 x1 + " 2 x 2 + #


What do you think will happen? Lets try to anticipate how well the model will do overall. HDD explains 49% of the variation in gas use, and the number of rooms explains 21%. If the effects of temperature and size ! are unrelated, a model with both would explain 21 + 49 = 70% of the variation. In fact, a multiple regression model using these two explains quite a bit less, and were going to have to figure out why. Heres the summary of the multiple regression. The format should look familiar. R2 se n Term Estimate Intercept b0 -1.775 Heating Degree Days (000) b1 12.78 Number of Rooms b2 6.882
Table 24-3. Summary of the multiple regression.

0.5457 27.943 512 Std Error t-statistic 5.339 -0.33 0.657 19.45 0.861 7.99 p-value 0.7396 <.0001 <.0001

24-6

7/27/07

24 Multiple Regression

We interpret R2 and se in Table 24-3 as in simple regression. R2 = 0.5457 indicates that the equation of the model represents about 55% of the variation in gas usage. Thats less than we expected, but more than either simple regression. The estimate se = 27.943 MCF estimates the SD of the underlying model errors. Because multiple regression explains more variation, the residual SD is smaller than with simple regression. (The equation for se is in the Formulas section at the end of the chapter.) The rest of Table 24-3 describes the equation. The equation has an intercept and two slopes, taken from the column labeled Estimates: Estimated Gas Use = -1.775 + 12.78 MHDD + 6.882 Number of Rooms The slope for MHDD in this equation differs from the slope for MHDD in the simple regression (14.51). Thats not a mistake: the slope of an explanatory variable in multiple regression estimates something different from the slope of the same variable in simple regression.

Partial and Marginal Slopes


The slope 14.51 of MHDD in the simple regression (Table 24-1) estimates the average difference in gas consumption between homes in different climates. Homes in a climate with 3,000 HDD use 14.51 more MCF of gas, on average, than homes in a climate with 2,000 HDD. Because it ignores other differences between homes, a slope in an SRM is called the marginal slope for y on x. The slope 12.78 for MHDD in the multiple regression (Table 24-3) also estimates the difference in gas consumption between homes in different climates, but it limits the comparison to homes with the same number of rooms. Homes in a climate with 3,000 HDD use 12.78 more MCF of gas, on average, than homes with the same number of rooms in a climate with 2,000 HDD. Because multiple regression adjusts for other factors, the slope in a multiple regression is known as a partial slope. To appreciate why these estimates are different, consider a specific question. Suppose a customer calls the utility with a question about the added costs for heating a one-room addition to her home. The marginal slope for the number of rooms estimates the difference in gas use between homes that differ in size by one room. The marginal slope 12.41 MCF/Room (Table 24-2) indicates that, on average, larger homes use 12,410 more cubic feet of gas. At $10 per MCF, an added room increases the annual heating bill by about $124. The partial slope 6.88 MCF/Room (Table 24-3) indicates that, on average, homes with another room use 6.88 more MCF of gas, costing $69 annually. Which slope provides a better answer: the marginal or partial slope for the number of rooms? The reason for the difference between the estimated slopes is the association between the two explanatory variables. On average, homes with more rooms are in colder climates as shown in the following plot. (The line is the least squares fit of MHDD on number of rooms.) 24-7

7/27/07

24 Multiple Regression

Figure 24-3. Homes with more rooms tend to be in colder climates.

Simple regression compares the average consumption of smaller to larger homes, ignoring that homes with more rooms tend to be in colder places. Multiple regression adjusts for the association between the explanatory variables; it compares the average consumption of smaller to larger homes in the same climate. Thats why the partial slope is smaller. The marginal slope in the simple regression mixes the effect of size (number of rooms) with the effect of climate. Multiple regression separates them. Unless the homeowner who called the utility moved her home to a colder climate when she added the room, she should use the partial slope! The correlation between MHDD and the number of rooms is r = 0.33. Correlation between explanatory variables is known as collinearity. Collinearity between the explanatory variables explains why R2 does not go up as much as we expected. Evidently, theres a bit of overlap between the two explanatory variables. Some of the variation explained by HDD is also explained by the number of rooms. Similarly, some of the variation explained by the number of rooms is explained by HDD. If we add the R2s from the simple regressions, we double count the variation that is explained by both explanatory variables. Multiple regression counts it once and the resulting R2 is smaller.

Collinearity Correlation between the explanatory variables in a multiple regression.

Path Diagrams
Path diagrams offer another way to appreciate the differences between marginal and partial slopes. A path diagram shows a regression model as a collection of nodes and directed edges. Nodes represent variables and directed edges show slopes. A note of warning in advance: these pictures of models often suggest that weve uncovered the cause of the variation. Thats not true; like simple regression, multiple regression models association. Lets draw the path diagram of the multiple regression. Arrows lead from the explanatory variables to the response. The diagram also joins the explanatory variables to each other with a double-headed arrow that symbolizes the association between them. 24-8

7/27/07

24 Multiple Regression

0.4322 MHDD/Room

1000 Heating Degree Days

12.78 MCF/MHDD Consumption of natural gas

0.2516 Rooms/MHDD

Number of Rooms

6.882 MCF/Room

Figure 24-4. Path diagram of the multiple regression.

Its important to keep units with the slopes, particularly for the doubledheaded arrow between the explanatory variables. The slopes for this edge come from two simple regressions: one with rooms as the response (see Figure 24-3) and the other with MHDD as the response. Estimated Number of Rooms = 5.2602 + 0.2516 MHDD Estimated MHDD = 1.3776 + 0.4322 Number of Rooms As an example, consider the difference in consumption between homes in climates that differ by 1,000 heating degree days. The arrow from HDD to the response indicates that a difference of 1 MHDD produces an average increase of 1 MHDD 12.78 MCF/MHDD = 12.78 MCF of natural gas. Thats the direct effect of colder weather: the furnace runs more. A colder climate has an indirect effect, too. Homes in colder climates tend to have more rooms, and larger homes require more heat. Following the path to the number of rooms, homes in the colder climate average 1 MHDD 0.2516 Rooms/MHDD = 0.2516 Rooms more than those in the warmer climate. This difference also increases gas use. Following the path from number of rooms to consumption, 0.2516 rooms converts into 0.2516 Rooms 6.882 MCF/Room 1.73 MCF of natural gas. If we add this indirect effect to the direct effect, homes in the colder climate use (on average) 12.78 MCF + 1.73 MCF = 14.51 MCF more MCF of natural gas than those in the warmer climate. If youve got that dj vu all over again feeling, look back at the summary of the simple regression of gas use on MHDD in Table 24-1. The marginal slope is 14.51. Thats exactly what weve calculated from the path diagram. Neat, but lets make sure we understand why that happens. Simple regression answers a simple question. On average how much more gas do homes in a colder climate use? The marginal slope shows that those in the colder climate use 14.51 more MCF of natural gas. The partial slope for 24-9

7/27/07

24 Multiple Regression

HDD in the multiple regression answers a different question. It estimates the difference in gas use due to climate among homes of the same size excluding the pathway through the number of rooms. Multiple regression separates the marginal slope into two parts: the direct effect (the partial slope) and the indirect effect (blue). Heating Degree Days 12.78 MCF/HDD Use of natural gas for heating

Number of Rooms 0.2516 Rooms/HDD

6.882 MCF/Room

The sum of the two paths reproduces the simple regression slope, 12.78 MCF + 0.2516 6.882 MCF 14.51 MCF
direct effect + indirect effect = marginal effect

Marginal = Partial If theres no collinearity (uncorrelated explanatory variables), then the marginal and partial slopes agree.

Once you appreciate the role of indirect effects, you begin to think differently about the marginal slope in a simple regression. The marginal slope blends the direct effect of an explanatory variable with all o its indirect effects. Its fine for the marginal slope to add these effects. The problem is that we sometimes forget about indirect effects and interpret the marginal slope as though it represents the direct effect. Path diagrams help you realize something else about multiple regression, too. When would the marginal and partial slope be the same? They match if there are no indirect effects. This happens when the explanatory variables are uncorrelated, breaking the pathway for the indirect effect. A contractor rehabs suburban homes, replacing exterior windows and siding. Hes kept track of material costs for these jobs (basically, the costs of replacement windows and new vinyl siding). The homes hes fixed vary in size. Usually repairs to larger homes require both more windows and more siding. He fits two regressions, a simple regression of cost on the number of windows and a multiple regression of cost on the number of windows and square feet of siding. Which should be larger: the marginal slope for the number of windows in the simple regression or the partial slope for the number of windows in the multiple regression?4
The marginal slope is larger. Bigger homes that require replacing more windows also require more siding; homes with more windows are bigger. Hence the marginal slope combines the cost of windows plus the cost of more siding. Multiple regression separates these costs. In the language of path diagrams, the marginal slope combines a positive direct effect with a positive indirect effect.
4

AYT

24-10

7/27/07

24 Multiple Regression

R2 Increases with Added Variables


Is the multiple regression better than the initial simple regression that considers only the role of climate? The overall summary statistics look better: R2 is larger and se is smaller. We need to be choosy, however, before we accept the addition of an explanatory variable. With many possible explanations for the variation in the response, we should only include those explanatory variables that add value. How do you tell if an explanatory variable adds value? R2 is not very useful unless it changes dramatically. R2 increases every time you add a explanatory variable to a regression. Add a column of random numbers and R2 goes up. Not by much, but it goes up. The residual standard deviation se shares a touch of this perversion; it goes down too easily for its own good. For evaluating the changes brought by adding an explanatory variable, we need confidence intervals and tests. That means looking at the conditions for multiple regression.

Under the Hood: Why R2 gets larger


How does software figure out the slopes in a multiple regression? We thought youd never ask! The software does what it did before: minimize the sum of squared deviations, least squares. Only now it gets to use another slope to make the sum of squared residuals smaller. With one x, the software minimizes the sum of squares
n

min # ( y i " b0 " b1 x1,i )


b 0 , b1 i= 1

by inserting the least squares estimates for b0 and b1. Look at what happens when we add x2. In a way, the x2 has been there all along but with its ! slope constrained to be zero. Simple regression constrains the slope of x2 to zero:
n

min # ( y i " b0 " b1 x1,i " 0 x 2,i )


b 0 , b1 i= 1

When we add x2 to the regression, the software is free to choose a slope of x2. It gets to solve this problem:
n

b 0 , b1 ,b 2

min

#( y
i= 1

" b0 " b1 x1,i " b2 x 2,i )

Now that it can change b2, the software can find a smaller residual sum of squares. Thats why R2 goes up. A multiple regression with two explanatory variables has more choices. This flexibility allows the fitting ! procedure to explain more variation and increase R2.

The Multiple Regression Model


The Multiple Regression Model (MRM) resembles the SRM, only its equation has several explanatory variables rather than one. The equation 24-11

7/27/07

24 Multiple Regression

for multiple regression with two explanatory variables describes how the conditional average of y given both x1 and x2 depends on the xs: E(y|x1,x2) = y | x1 ,x 2 = " 0 + "1 x1 + " 2 x 2 . Given x1 and x2, y lies at its mean plus added random noise the errors: y = y | x1 ,x 2 + " ! only the conditional means of y change with the According to this model, explanatory variables x1 and x2. The rest of the assumptions describe the errors; these are the same ! three assumptions as in the SRM: 1. The errors are independent, with 2. Equal variance " 2 , and ideally are 3. Normally distributed. Ideally, the errors are an iid sample from a normal distribution. As in the ! SRM, the MRM requires nothing of the explanatory variables. Because we want to see how y varies with changes in the xs, all we need is variation in the xs. It would not make sense to a constant as an explanatory variable. The equation of the MRM embeds several assumptions. It implies that the average of y varies linearly with each explanatory variable, regardless of the other. The xs do not mediate each others influence on y. Differences in y associated with differences in x1 are the same regardless of the value of x2 (and vice versa). That can be a hard assumption to swallow. Heres an equation for the sales of a product marketed by a company: Sales = 0 + 1 Advertising Spending + 2 Price Difference + The price difference is the difference between the list price of the companys product and the list price of its main rival. Lets examine this equation carefully. It implies that the impact on sales, on average, of spending more for advertising is limitless. The more it advertises, the more it sells. Also, advertising has the same effect regardless of the difference in price. It does not matter which costs more advertising has the same effect. That may be the case, but there are situations in which the effect of advertising depends on the difference in price. For example, the effect of advertising might depend on whether the ad is promoting a difference in price. Weve seen remedies for some of these problems. For instance, a log-log scale captures diminishing returns, with log sales regressed on log advertising but that does nothing for the second problem. Special explanatory variables known as interactions allow the equation to capture synergies between the explanatory variables, but well save interactions for Chapter 26. For now, we have to hope that the effect of one explanatory variable does not depend on the value of the other. 24-12

7/27/07

24 Multiple Regression

Since the only difference between the SRM and the MRM is the equation, it should not come as a surprise that the check-list of conditions match. Straight enough No embarrassing lurking variable Similar variances Nearly normal The difference lies in the choice of plots that used to verify these conditions. Simple regression is simple because theres one key plot: the scatterplot of y on x. If you cannot see a line, neither can the computer. Multiple regression offers more choices, but there is a natural sequence of plots that go with each numerical summary. You want to look at these before you dig into the output very far.

Calibration Plot
The summary of a regression usually begins with R2 and se, and two plots belong with these. These two plots convert multiple regression into a special simple regression. Indeed, most plots make multiple regression look like simple regression in one way or another. This table repeats R2 and se for the two-predictor model for natural gas. R2 se
Table 24-4. Overall summary of the two-predictor model.

0.5457 27.97

The calibration plot summarizes the fit of a multiple regression, much as a scatterplot of y on x summarizes a simple regression. In particular, the = b0 + b1 x1 + calibration plot is a scatterplot of y versus the fitted value y b2 x2. Rather than plot y on either x1 or x2, the calibration plot places the fitted values on the horizontal axis.
200

R2 The square of the correlation between y . and the estimate y

Natural Gas (MCF)

150 100 50 0 20 30 40 50 60 70 80 90 110 130 150

Estimated Natural Gas (MCF)

Figure 24-5. Calibration plot for the two-explanatory variable MRM.

For the simple regression the scatterplot of y on x shows R2: its the square of the correlation between y and x. Similarly, the calibration plot shows R2 for the multiple regression. R2 in multiple regression is again the square of a correlation, namely the square of the correlation between y and the fitted 24-13

7/27/07

24 Multiple Regression

. The tighter the data cluster along the diagonal in the calibration values y plot, the better the fit.

The Residual Plot


!

The plot of residuals on x is very useful in simple regression because it zooms in on the deviations around the fitted line. The analogous plot is useful in multiple regression. All we do is shear the calibration plot, twisting it so that the regression line in the calibration plot becomes flat. , on the fitted values y . In other words, we plot the residuals, y - y This view of the fit shows se. If the residuals are nearly normal, then 95% of them lie within 2 se of zero. Heres the plot of residuals on fitted values for the multiple regression ! of gas use on HDD and ! rooms.
Natural Gas (MCF) Residual
80 60 40 20 0 -20 -40 -60 10 20 30 40 50 60 70 80 90100 120 140 160

Estimated Natural Gas (MCF)

Figure 24-6. Scatterplot of residuals on fitted values.

You can guess that se is about 30 MCF from this plot because all but about 10 cases (out of 512) lie within 60 MCF of the horizontal line at zero (The actual value of se is 27.97 MCF.) The residuals should suggest an angry swarm of bees with no real pattern. In this example, the residuals suggest a lack of constant variation (heteroscedasticity, Chapter 23). The residuals at the far left seem less variable than the rest, but the effect is not severe. Thats the most common use of this plot: checking the similar variances condition. Often, as weve seen in Chapter 23, data become more variable tracks the size of the predictions, this plot is the as they get larger. Since y natural place to check for changes in variation. If you see a pattern in this plot, either a trend in the residuals or changing variation, the model does not meet the conditions of the MRM. In this example, the residuals hover ! around zero with no trend, but there might be a slight change in the variation.

Checking Normality
The last condition to check is the nearly normal condition. We havent discovered outliers or other problems in the other views, so chances are this one will be OK as well. Heres the normal quantile plot of the residuals.

24-14

7/27/07
80 60 40 20 0 -20 -40 -60 25 50 75 -3 -2 -1 0 1 2 3 .01 .05 .10 .25 .50 .75 .90 .95 .99

24 Multiple Regression

Count

Normal Quantile Plot

Figure 24-7. Normal quantile plot of residuals from the multiple regression.

Its a good thing that we checked. The residuals are slightly skewed. The residuals reach further out on the positive side (to +80) than on the negative side (60). The effect is mild, however, and almost consistent with normality. Well take this as nearly normal, but be careful about predicting the gas consumption of individual homes. (The skewness and slight shift in variation in Figure 24-6 suggest you might have problems with prediction intervals.) For inferences about slopes, however, were all set to go.

Inference in Multiple Regression


Its time for confidence intervals. As usual, we start with the standard error. The layout of the estimates for multiple regression in Table 24-3 matches that for simple regression, only the table of estimates has one more row. Each row gives an estimate and its standard error, followed by a t-statistic and p-value.
Term Estimate Std Error t-statistic p-value

Intercept b0 Heating Degree Days (000) b1 Number of Rooms b2


Table 24-5. Summary of the multiple regression.

-1.775 12.78 6.882

5.339 0.657 0.861

-0.33 19.45 7.99

0.7396 <.0001 <.0001

The procedure for building confidence intervals is the same as in simple regression. If we have a large sample (n > 30), the do-it-yourself 95% confidence intervals have the form estimate 2 se(estimate) This table compares confidence intervals for the partial slopes to those for the marginal slopes from the two simple regressions. Both partial slopes are smaller with about the same standard errors. Marginal Partial Heating Degree Days Number of Rooms 14.51 2 0.66 12.41 2 1.07 12.78 2 0.66 6.88 2 0.86

Table 24-6. Comparison of confidence intervals for marginal and partial slopes.

24-15

7/27/07

24 Multiple Regression

We interpret confidence intervals in multiple regression like those in simple regression. For example, lets revisit the homeowner whos added a one-room addition to her home. On average, the partial slope for number of rooms indicates that she can expect her annual consumption of natural gas to increase by about [6.88 - 2 0.86, 6.88 + 2 0.86] = 5.16 to 8.60 thousand cubic feet. Tests in regression come in two forms: those for each explanatory variable and those for the model as a whole. These are equivalent in simple regression because theres only one explanatory variable. With several explanatory variables, we can look at them individually or collectively. The two columns in Table 24-5 that follow the standard errors are derived from the estimates and standard errors. The t-statistic tests the null hypothesis that the intercept or a specific slope is zero. Well write this generic null hypothesis as H0: j =0. This test works as in simple regression. Each t-statistic is the ratios of the estimate to its standard error. As before, a t-statistic counts the number of standard errors that separate the estimated slope from zero, the default hypothesized value: b "0 tj = j se(b j ) Values of tj larger than 2 or less than -2 are far from zero. The p-values in the final column use Students t-model (Chapter 17) to assign a probability to the t-statistic. You can estimate the p-value from the ! Empirical rule. Roughly, if |tj| > 2, then the p-value < 0.05.
Rejecting H0: j = 0 means 1. Zero is not in the confidence interval, 2. The regression with xj added explains statistically significantly more variation.

The t-statistics and p-values in Table 24-5 indicate that both partial slopes in the multiple regression differ significantly from zero. Agreeing with this, neither confidence interval in Table 24-6 contains zero. For example, by rejecting H0: 1 = 0, we believe that homes in colder climates use more gas than homes with the same number of rooms in warmer climates. The test of a slope in a multiple regression also has an incremental interpretation related to the predictive accuracy of the fitted model. If we reject H0: 1 = 0, a regression that includes HDD explains statistically significantly more variation in natural gas use than one without this explanatory variable. The addition of HDD significantly improves the fit of the model beyond what is achieved using the number of rooms alone. That is, by adding HDD to a model that contains the number of rooms, R2 rises by a statistically significant amount. Similarly, by rejecting H0: 2 = 0, we see that adding the number of rooms to the simple regression that has only HDD improves the R2 of the fit by a statistically significant amount. The order of the variables listed in the table doesnt matter; a t-statistic adjusts for all of the other explanatory variables. Had the absolute size of either t-statistic been small (|tj| < 2) or the pvalue large (more than 0.05), the data could be a sample from a population in which the partial slope is zero, j = 0. To illustrate this situation, have a look at the t-statistic for the intercept in the multiple regression. The t24-16

tip

7/27/07

24 Multiple Regression

statistic for the test of H0: 0 = 0 is 0.33; the estimate lies 1/3 of a standard error below zero. The p-value indicates that 74% of random samples from a population with 0 = 0 produce estimates this far (or farther) from zero. Because the p-value is larger than 0.05, we cannot reject H0. Whats that mean? It does not mean 0 is zero; it only suggests that it might be zero, negative, or positive.

The F-Test in Multiple Regression


Multiple regression adds a test that we dont need in simple regression. Its called the F-test, and it usually appears in a portion of the output known as the Analysis of Variance, abbreviated ANOVA. Well have more to say about the ANOVA table in Chapter 25. (The F-statistic is not needed in simple regression because the t-statistic serves the same purpose. In a simple regression the F-statistic is the square of the t-statistic for the one slope, F = t2 and both produce the same p-value.) The F-test, which is obtained using the F-statistic, measures the explanatory value of all of the explanatory variables, taken collectively rather than separately. t-statistics consider the explanatory variables oneat-a-time; an F-statistic looks at them collectively. What null hypothesis is being tested? For the F-statistic, the null hypothesis states that your data is a sample from a population for which all the slopes are zero. In this case, the null hypothesis is H0: 1 = 2 = 0. In other words, it tests the null hypothesis that the model explains nothing. Unless you can reject this one, the explanatory variables collectively explain nothing more than random variation. The F-test in regression comes in handy because of the problem with R2. Namely, it gets larger whenever you add an explanatory variable. In fact, if you add enough explanatory variables, you can make the R2 as large as you want. You can think of the F-test as a test of the size of R2. Suppose that a friend of yours whos taking Stat tells you that she has built a great regression model. Her regression has an R2 of 98%. Before you get impressed, you ought to ask her a couple of questions: How many explanatory variables are in your model? and How many observations do you have? If her model obtains an R2 of 98% using two explanatory variables and n = 1000, you should learn more about her model. If she has n=50 cases and 45 explanatory variables, the model is not so impressive. R2 does not track the number of explanatory variables or the sample size. The F-statistic cares. R2 is the ratio of how much variation resides in the fitted values of your model compared to the variation in y.
n

t vs F The t-stat tests the effect of one explanatory variable, the F-stat tests the combination of them all.

R2 =

Variation in y = Variation in y

#( y
i= 1 n

" y) " y)

#( y
i= 1

24-17
!

7/27/07

24 Multiple Regression

The more explanatory variables you have, the more variation gets tied up in the fitted values. As you add explanatory variables, the top of this ratio gets smaller and the bottom stays the same. R2 goes up. The F-statistic doesnt offer this free lunch; it charges the model for each explanatory variable. For a multiple regression with q explanatory variables, the Fstatistic is R2 Variation in y per explanatory variable q F= = 2 Variation remaining per residual d.f. 1 " R n " q "1
= R2 n " q "1 # 2 1" R q

If you have relatively few explanatory variables compared to the number of cases (q << n), then its useful to approximate the F-statistic as ! R2 n R2 number of cases F" $ = $ 2 2 1# R q 1 # R number of explanatory variables The F-statistic is roughly the ratio of the variation that you have explained relative to what is left over, multiplied by the number of cases per explanatory variable. For this model, ! 0.5457 509 F" $ " 1.20 $ 254.5 = 305.4 1 # 0.5457 2 Whats a big value for the F-statistic? As a conservative rule-of-thumb, anything above 4 is statistically significant (depending on n and q, smaller values may ! as well). Otherwise, check your output for a p-value. We soundly reject the null hypothesis in this example. Theres no way this data is a sample from a population with both slopes equal to zero. Youd never get an R2 of 55% with 2 predictors and 512 observations unless some slope differs from zero. The F-test is a good place to start when youre working with multiple regression. Its a gatekeeper. Unless you reject its null hypothesis, you shouldnt go further. Unless its p-value is less than 0.05, then youve got little reason to look at the individual slopes. We were hasty and jumped into the partial slopes in this example before checking the F-statistic. In practice, however, you shouldnt peek at the slopes until youve verified that the F-test rejects its H0.

AYT

Our contractor (previous AYT) got excited about regression, and he particularly liked the way R2 went up with each added variable. By the time he was done, R2 = 0.892. He has data on costs at n = 49 homes and he used q = 26 explanatory variables. Are you impressed by his model?5
The overall F-statistic is F = R2/(1-R2) (n -1-q)/q = 0.892/(1-0.892) * (49-1-26)/26 7. Thats statistically significant. His model does explain more than random variation, but hes going to have a hard time sorting out what those slopes mean.
5

24-18

7/27/07

24 Multiple Regression

Steps in Building a Multiple Regression


Lets summarize the steps that weve considered and the order that weve made them. As in simple regression, it pays to start with the big picture. 1) What problem are you trying to solve? Do these data help you? Until you know enough about the data and problem to answer these two questions, theres no need to fit a complicated model. 2) Check the scatterplots of the response versus the explanatory variables and also those that show relationships among the explanatory variables. Make sure that you understand the measurement units of the variables and identify any outliers. 3) If these scatterplots of y on xj appear straight enough, fit the multiple regression. Otherwise, find a transformation to straighten out a relationship that bends. 4) Find the residuals and fitted values from your regression. and 5) Make scatterplots that show the overall model (y on y ). The residual plot should look simple. The residuals versus y residual plot is the best place to identify changing variances. 6) Check whether the residuals are nearly normal. If not, be very ! cautious !about using prediction intervals. 7) Check the F-statistic to see whether you can reject the null model conclude that some slope differs from zero. If not, go no further with tests. 8) Test and interpret individual partial slopes.

4M Subprime Mortgages

Subprime mortgages dominated business news in 2007. A subprime mortgage is a loan made to a more risky borrower than most. As this example shows, theres a reason that bank and hedge funds plunged into the risky housing market: these loans earn more interest so long as the borrower can keep paying. Defaults on such mortgages brought down several hedge funds that year. For this analysis, weve made you an analyst at a creditor whos considering moving into this market. The two explanatory variables in this analysis are common in this domain. The loan-to-value ratio (LTV) captures the exposure of the lender to defaults. For example, if LTV = 0.80 (80%), then the mortgage covers 80% of the value of the property. The FICO score (named for its owner, the Fair-Isaac Company) is the most common commercial measure of the credit worthiness of a borrower.

Motivation State the business decision.

As a mortgage lender, my company would like to know what might be gained by moving into the subprime market. In particular, wed like to know which characteristics of the borrower and loan affect the amount of interest we can earn on

24-19

7/27/07
loans in this category.

24 Multiple Regression

Method Plan your analysis. Identify the predictor and the response.

Ill use a multiple regression. I have two explanatory variables: the credit rating score of the borrower (FICO) and the ratio of the value of the loan to the value of the property (LTV). The response is the annual percentage rate of interest earned by the loan (APR). The partial slope for FICO describes whether for a given LTV how much poor credit costs the borrower. The partial slope for LTV controls for the exposure were taking on the loan: the higher LTV, the more risk we face if the borrower defaults. We obtained data on 372 mortgages from a credit bureau that monitors the subprime market. These loans are a SRS of subprime mortgages within the geographic area where we operate. Scatterplots of APR on both explanatory variables seem reasonable, and FICO and LTV are not dramatically correlated with each other. The relationships are linear, so Ill summarize them with a table of correlations. APR LTV LTV FICO -0.4265 -0.6696 0.4853

Relate regression to the business decision.

Describe the sample.

Use a plot to make sure correlations make sense. Weve omitted them here to save space, but look before you summarize. For example, heres APR versus LTV.

Verify the big-picture condition. Mechanics Check the additional conditions on the errors by examining the residuals.

Straight-enough. Seems OK from scatterplots of APR versus LTV and FICO. Theres no evident bending, the plots indicate moderate dependence, and no big outliers. No embarrassing lurking factor. I can imagine a few other factors that may be omitted, such as more precise geographic information. Other aspects of the borrower (age, race) better not matter or something illegal is going on. Similar variances. The plot of residuals on fitted values shows consistent variation over the range of fitted values. (There is one outlier and some skewness in the residuals, but those features are more visible in the quantile plot.) Nearly normal. The histogram and normal quantile plot confirm the skewness of the residuals. The regression fit underestimates APR by up to 7%, but rarely overpredicts by more than 2%. Since Im not building a prediction

24-20

7/27/07

24 Multiple Regression
interval for individual loans, Ill rely on the CLT to produce normal sampling distributions for my estimates. Heres the summary of my fitted model

R2 se

0.4619 1.242

F = R /(1-R )*(n-1-2)/2 = 0.4619/(1-0.4619) * (372-3)/2 158

It explains about of the variation (R2 = 0.46) in the interest rates, which is highly statistically significant by the F-test. I can clearly reject H0 that both slopes are zero. This model explains real variation in APR. The SD of the residuals is se = 1.24%. Term b0 LTV FICO Est SE t Stat p-value 23.691 0.650 36.46 <.0001 -1.577 0.518 -3.04 0.003 -0.0186 0.0013 -13.85 <.0001

If there are no severe violations of the conditions, summarize the overall fit of the model. Describe the estimated equation. Build confidence intervals for the relevant parameters.

The fitted equation is Estimated APR -24.7 - 1.6 LTV - 0.02 FICO The effect of LTV is poorly determined, with a wide confidence interval of -1.6 2(0.5). The effect of the borrowers credit rating is much stronger, with a larger t-statistic and tighter confidence interval -0.0192(0.001). Ill round to 1 digit for LTV and show 4 digits for the FICO score. Based on my analysis of the supplied sample of 372 mortgages, it is clear that characteristics of both the borrower (FICO score) and loan (loan to value ratio) affect interest rates for loans in the subprime market. These two factors alone describe about of the variation in rates. Among mortgages that cover a fixed percentage of the value of the property (e.g., LTV=0.8), risky borrowers with FICO score 500 pay on average between 4.24% to 3.2% more than borrowers with FICO score 700. These are a small sample of only 372 of the thousands of mortgages in our area. The data also suggest that some loans have much higher interest rates than what we would expect from this basic analysis. Evidently, some borrowers are more risky than the FICO score indicates.

With all of these other details, dont forget to round to the relevant precision. Message This part comes from the overall model.

Interpret slopes as partial slopes. Convert the slopes into meaningful comparisons. 200 * (-0.0186 - 2 * 0.0013) to 200 *( -0.0186 + 2 * 0.0013) Note important caveats that are relevant to the question at hand, such as the skewness in this example.

24-21

7/27/07

24 Multiple Regression

Summary
The Multiple Regression Model (MRM) expands the Simple Regression Model by incorporating other explanatory variables in its equation. The slopes in the equation of an MRM are partial slopes that typically differ from the marginal slopes in a simple regression. Collinearity is the correlation between predictors in a multiple regression. A path diagram is a useful figure for distinguishing the direct and indirect effects of shows the overall fit of explanatory variables. A calibration plot of y on y allows the model, visualizing R2 of the model. The plot of residuals on y a check for similar variances. The F-statistic measures the overall significance of the fitted model, and individual t-statistics for each partial ! slope test the incremental improvement in R2 obtained by adding that ! variable to a model containing the other explanatory variables.

Index
calibration plot, 24-13 collinearity, 24-8 F-statistic, 24-17, 24-18 F-test, 24-17 slope marginal, 24-7 partial, 24-7

Formulas
In each formula, q denotes the number of explanatory variables. For the examples in this chapter, q = 2. F-statistic
F= per predictor Variation in y q = 2 Variation remaining per residual d.f. 1 " R n " q "1 R2 n " q "1 # 2 1" R q R2

se

Divide the sum of squared residuals by n minus the number of estimates in the equation. For a multiple regression with q=2 ! explanatory variables (and estimates b0, b1, and b2)
n n 2 i 2 se =

"e
i= 1

"( y
=
i= 1

# b0 # b1 x i,1 # b2 x i,2 ) n #1# 2

n #1# q

Best Practices
Know !the context of your model. Its important in simple regression, but even more important in multiple regression. How are you supposed to guess what factors are missing from the model unless you know something about the problem and the data? 24-22

7/27/07

24 Multiple Regression

Examine plots of the overall model and coefficients before interpreting the output. You did it in simple regression, and you need to do it even more in multiple regression. It can be really, really tempting to dive right into the output rather than hold off to look at the plots, but youll find that you make better choices by being more patient. Check the overall F-statistic before digging into the t-statistics. If you look at enough t-statistics, youll find that you eventually find explanatory variables that are statistically significant. Statistics awards persistence. If you check the F-statistic first, youll avoid the worst of these problems. Distinguish marginal from partial slopes. A marginal slope combines the direct and indirect effects. A partial slope avoids the indirect effects of other variables in the model. Some would say that the partial slope holds the other variables fixed but thats too far from the truth. It is true is a certain mathematical sense, but we didnt hold anything fixed in our example, we just compared energy consumption among homes of differing size in different climates. Let your software compute prediction intervals in multiple regression. Extrapolation is harder to recognize in multiple regression. For instance, suppose we were to use our multiple regression to estimate gas consumption for 4-room homes in climates with 7,500 heating DD. Thats not an outlier on either variable alone, but its an outlier when we combine the two! In general, prediction intervals have the same form in multiple regression as in simple regression, namely predicted value 2 se. This only applies when not extrapolating. If you do extrapolate, you better let the software do the calculations. We always do in multiple regression.
8 7

Heating DD (000)

6 5 4 3 2 1 2 3 4 5 6 7 8 9 10 11

Number of Rooms

Figure 24-8. Outliers are more subtle when the xs are correlated.

Pitfalls
Become impatient. Multiple regression takes time not only to learn, but also to do. If you get hasty and skip the plots, you may fool yourself into thinking youve figured it all out, only to discover later than it was just an outlier. 24-23

7/27/07

24 Multiple Regression

Think that you have all of the important variables. Sure, we added a second variable to our model for energy consumption and it made a lot of sense. That does not mean that weve gotten them all, however. In most applications, it virtually impossible to know whether youve found all of the relevant explanatory variables. Think that youve found causal effects. Unless you got your data by running an experiment (and this does happen sometimes), you cannot get causation from a regression model. No matter how many explanatory variables you have in the model. Just as we did not hold fixed any variable, we probably did not change any of the variables either. Weve just compared averages. Think that an insignificant t-statistic implies an explanatory variable has no effect. All it means if we cannot reject H0: 1 = 0 is that this slope might be zero. It might not. The confidence interval tells you a bit more. If the confidence interval includes zero, its telling us that the partial slope might be positive or might be negative. Just because we dont know the direction (or sign) of the slope doesnt mean its zero.

About the Data


The data on household energy consumption are from the Residential Energy Consumption Survey conducted by the US Department of Energy. The study of subprime mortgages is based on an analysis of these loans in a working paper Mortgage Brokers and the Subprime Mortgage Market produced by A. El Anshasy, G. Elliehausen, and Y. Shimazaki of George Washington University and The Credit Research Center of Georgetown Univerity.

Software Hints
The software commands for building a multiple regression are essentially those used for building a model with one explanatory variable. All you need to do is select several columns as explanatory variables rather than just one.

Excel

To fit a multiple regression, follow the menu commands Tools > Data Analysis > Regression (If you dont see this option in your Tools menu, you will need to add these commands. See the Software Hints in Chapter 19.) Selecting several columns as X variables produces a multiple regression analysis. The menu sequence

Minitab

Stat > Regression > Regression constructs a multiple regression if several variables are chosen as explanatory variables. 24-24

7/27/07

24 Multiple Regression

JMP

The menu sequence Analyze > Fit Model constructs a multiple regression if two or more variables are entered as explanatory variables. Click the Run Model button to obtain a summary of the least squares regression. The summary window combines the now familiar numerical summary statistics as well as several plots.

24-25