Ewan

Chapter 22
Multiple linear regression

Contents
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 22.1.1 Data format and missing values . . . . . . . . . . 22.1.2 The formal model . . . . . . . . . . . . . . . . . 22.1.3 Assumptions . . . . . . . . . . . . . . . . . . . 22.1.4 Obtaining Estimates . . . . . . . . . . . . . . . . 22.1.5 Predictions . . . . . . . . . . . . . . . . . . . . 22.1.6 Example: blood pressure . . . . . . . . . . . . . 22.2 Regression problems and diagnostics . . . . . . . . . . 22.2.1 Introduction . . . . . . . . . . . . . . . . . . . . 22.2.2 Preliminary characteristics . . . . . . . . . . . . 22.2.3 Residual plots . . . . . . . . . . . . . . . . . . . 22.2.4 Actual vs. Predicted Plot . . . . . . . . . . . . . 22.2.5 Detecting inuential observations . . . . . . . . . 22.2.6 Leverage plots . . . . . . . . . . . . . . . . . . . 22.2.7 Collinearity . . . . . . . . . . . . . . . . . . . . 22.3 Polynomial, product, and interaction terms . . . . . . 22.3.1 Introduction . . . . . . . . . . . . . . . . . . . . 22.3.2 Example: Tomato growth as a function of water . 22.3.3 Polynomial models with several variables . . . . . 22.3.4 Cross-product and interaction terms . . . . . . . . 22.4 The general linear test . . . . . . . . . . . . . . . . . . 22.4.1 Introduction . . . . . . . . . . . . . . . . . . . . 22.4.2 Example: Predicting body fat from measurements 22.4.3 Summary . . . . . . . . . . . . . . . . . . . . . 22.5 Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1431 1431 1432 1433 1435 1436 1437 1450 1450 1450 1451 1453 1453 1454 1462 1464 1464 1465 1484 1485 1486 1486 1487 1494 1494
1430
CHAPTER 22. MULTIPLE LINEAR REGRESSION

22.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 22.5.2 Dening indicator variables . . . . . . . . . . . . . . . . 22.5.3 The ANCOVA model . . . . . . . . . . . . . . . . . . . 22.5.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . 22.5.5 Comparing individual regression lines . . . . . . . . . . 22.5.6 Example: Degradation of dioxin . . . . . . . . . . . . . 22.5.7 Example: More rened analysis of stream-slope example 22.6 Example: Predicting PM10 levels . . . . . . . . . . . . . . . . 22.7 Variable selection methods . . . . . . . . . . . . . . . . . . . . 22.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 22.7.2 Maximum model . . . . . . . . . . . . . . . . . . . . . 22.7.3 Selecting a model criterion . . . . . . . . . . . . . . . . 22.7.4 Which subsets should be examined . . . . . . . . . . . . 22.7.5 Goodness-of-t . . . . . . . . . . . . . . . . . . . . . . 22.7.6 Example: Calories of candy bars . . . . . . . . . . . . . 22.7.7 Example: Fitness dataset . . . . . . . . . . . . . . . . . 22.7.8 Example: Predicting zoo plankton biomass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494 1495 1496 1500 1500 1504 1520 1528 1543 1543 1544 1545 1547 1548 1549 1560 1560
22.1
Introduction
In previous chapters, the relationship between a single, continuous variable (Y ) and a single continuous variable (X) was explored using simple linear regression. In this chapter, this will be generalized to the case of more than one explanatory (X) variable.1 There are many good books covering this topic - refer to the list in previous chapters. Fortunately, many of the techniques learned in the previous chapter on simple linear regression carry over directly to the more general multiple regression. There are a few subtle differences in interpretation, and additional problems (such a variable selection) must be solved. It turns out that multiple regression methods are very general methods covering a wide range of statistical problems under the rubric of general linear models. Surprisingly, multiple regression is a general solution for two-sample t-tests, for ANOVA models, for simple linear regression models, etc. The exact theory is beyond the scope of these notes, but intuitive explanations will be provided as needed.
22.1.1
Data format and missing values
The data are collected and stored in a tabular format with rows representing observations, and columns representing different variables. One of the variables will be the response (Y ) variable; there can be several
1 It is possible to also have more than one Y variable this is known as multivariate multiple regression but is not covered in this chapter.
c 2012 Carl James Schwarz
1431
CHAPTER 22. MULTIPLE LINEAR REGRESSION predictor (X) variables. Virtually all computer packages require variables to be stored in columns and observations stored in rows. The response variable (Y ) must be continuous. It is NOT appropriate to do multiple regression when the Y variable represents categories the appropriate methodology in this case is logistic regression. If the Y variable represents counts, a technique known as Poisson-regression may be more appropriate consult the chapter on generalized linear models for more details. Finally, in some cases, the value of Y may be censored, i.e. the exact value is not known, but it is known to be below certain threshold values (e.g. above or below detection limits). The analysis of such data is beyond the scope of these notes consult the chapter on Tobit analysis for details. Surprising, there is much more exibility in the type of the X variables. They may be continuous as seen previously in simple linear regression, or they may be dichotomous variables taking only the values of 0 or 1 (known as indicator variables).2 These indicator variables are used to represent different groups (e.g. male and female) in the data. The dataset is assumed to be complete, with NO missing values in any of the X variables. If an observation (row) has some missing X values, most computer packages practice what is known as case-wise deletion, i.e. the entire observation will be dropped from the analysis. Consequently, it is always important to check the computer output to see exactly how many observations have been used in the analysis. Missing Y values also imply that this observation (row) will be deleted from the analysis. However, if the set of X variables is complete, it is still possible to obtain predictions of Y for the observed set of X values. As in previous chapters, missing data should be examined to see if it is missing completely at random (MCAR) in which case there is usually no problem in the analysis other than reduced sample size, or missing at random (MAR), which is again handled relatively easily, or informative missing (IM) which poses serious problems in the analysis. Seek help for the latter case.
22.1.2
The formal model
The statistical model for multiple-regression is a extension of that for simple linear regression. The response variable, denoted by Y , is measured along with a set of predictor variables, denoted by X1 , X2 , . . . , Xp where p is the number of predictor variables. The formal statistical model is: Yi = 0 + 1 Xi1 + 2 Xi2 + . . . + p Xip +
i
where the unknown parameters are the set of s and the deviation between the observed value of Y and the predicted value from the regression equation, i is distributed as a Normal distribution with a mean of 0 and a variance of 2 .
2 In
actual fact, any binary set of values may be used, but traditional usage is to use 0 and 1.
1432
CHAPTER 22. MULTIPLE LINEAR REGRESSION This is often written using a short hand notation in many statistical packages as: Y = X1 X2 . . . Xp where the intercept (0 ) and the residual variation ( ) are implicit. This can also be written using matrices as Y = X + where Y is an n 1 column vector, X is an n (p + 1) matrix [dont forget the intercept column] of the predictors, is a (p + 1) 1 column vector (the intercept 0 , plus the p slopes 1 , . . . , p ), and is a n 1 vector of residuals that has a multivariate normal distribution with a mean of 0 and a covariance matrix of I 2 where I is the identity matrix. Note that this format for multiple regression is very exible. By appropriate denition of the X variables, many different problems can be cast into a multiple-regression framework. In future courses you will see that ANOVA (a technique to compare means among multiple groups) is actually nothing but regression in disguise!
22.1.3
Assumptions
Not surprising, the assumptions for a multiple regression analysis are very similar to those required for a simple linear regression.
Linearity Because of the multiple X variables, the assumption of linearity is not as straightforward as for simple linear regression. Multiple regression analysis assume that the MARGINAL relationship between Y and each X is linear. This means that if all other X variables are held constant, then changes in the particular X variable lead to a linear change in the Y variable. Because this is a MARGINAL relationship, simple plots of Y vs. each X variable may not be linear. This is because the simple pairwise plots cant hold the other variables xed. To assess this relationship, residuals from the t should be plotted against each X variable in turn. If the scatter is not random around 0 but shows some pattern (e.g. quadratic curve), this usually indicates that the marginal relationship between Y and that particular X is not linear. Alternatively, t a model that includes both X and X 2 and test if the coefcient associated with X 2 is zero. Unfortunately, this test could fail to detect a higher order relationship. Third, if there are multiple readings at some X-values, then a test of goodness-of-t (what JMP calls the Lack of Fit Test) can be performed where the variation of the responses at the same X value is compared to the variation around the regression line.
1433
CHAPTER 22. MULTIPLE LINEAR REGRESSION Correct sampling scheme The Y must be a random sample from the population of Y values for every set of X value in the sample. Fortunately, it is not necessary to have a completely random sample from the population as the regression line is valid even if the X values are deliberately chosen. However, for a given set of X, the values from the population must be a simple random sample. This latitude gives considerable freedom in selecting points to investigate the relationship between Y and X. This will be discussed more in class.
No outliers or inuential points All the points must belong to the relationship there should be no unusual points. The residual plot of the residual against the row number or against the predicted value should be investigated to see if there are unusual points. The marginal scatter plot of the residuals from the t vs. X should be examined. As well leverage plots (Section 22.2.6) are useful for detecting inuential points. Outliers can have a dramatic effect on the tted line.
Equal variation along the line The variability about the regression plane must is similar for all sets of X, i.e. the scatter of the points above and below the tted surface should be roughly constant over the entire line. This is assessed by looking at the plots of the residuals against each X variable to see if the scatter is roughly uniformly scattered around zero with no increase and no decrease in spread over the entire line.
Independence Each value of Y is independent of any other value of Y . The most common cases where this fail are time series data. This assumption can be assessed by again looking at residual plots against time or other variables.
1434
CHAPTER 22. MULTIPLE LINEAR REGRESSION Normality of errors The difference between the value of Y and the expected value of Y is assumed to be normally distributed. This is one of the most misunderstood assumptions. Many people erroneously assume that the distribution of Y over all X values must be normally distributed, i.e they look simply at the distribution of the Y s ignoring the Xs. The assumption only states that the residuals, the difference between the value of Y and the point on the line must be normally distributed. This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for small sample sizes, you have little power of detecting non-normality and for large sample sizes it is not that important.
X variables measured without error It sometimes turns out that the X variables are not known precisely. For example, if you wish to investigate the relationship of illness to second hand cigarette smoke, it is surprisingly difcult to get an estimate of the dose of cigarettes that a worker has been exposed to. This general problem is called the error in variables problem and has a long history in statistics. A detailed discussion of this issue is beyond the scope of these notes. The uncertainty in each X variable should be assessed.
22.1.4
Obtaining Estimates
The same principle of least squares as in simple linear regression is used to obtain estimates. In general, the sum of deviations between the predicted and observed values is computed, and the regression surface that minimizes this value is the nal relationship. The estimated intercept and slopes can be compactly expressed using matrix notation = (X X)1 X Y but details are beyond the scope of these notes. Hand formulae are all but impossible except for trivially small examples - let the computer do the work. Of course this implies that the scientist has the responsibility to ensure that the brain in engaged before putting the package in gear! As with all estimates, a measure of precision can be obtained. As before, this is the standard error of each of the estimates. Again, there are computational formulae, but in this age of computers, these are not important. As before, approximate 95% condence intervals for the corresponding population parameters are found as estimate 2 se. Most packages will compute the 95% condence intervals for the slope as well. 1435
CHAPTER 22. MULTIPLE LINEAR REGRESSION Once the t has been obtained, the t of the model can be assessed in various ways as outlined below. The overall t of the model is assess using a Whole Model Test that is traditionally placed in an ANOVA table. This test examines if there is at least one X variable that seems to be marginally related to the Y values. Usually, it is of little interest. The individual marginal contributions of each X variable can be assessed directly either from the reported estimates and standard errors or from an Effect Test these are exactly equivalent. Formal tests of hypotheses about the marginal contribution of each variable can also be done. Usually, these are only done on the slope parameter as this is typically of most interest. The null hypothesis is that population marginal slope of a particular X variable is 0, i.e. there is no marginal relationship between Y and and that particular X. More formally the null hypothesis for the Xi variable is: H : i = 0 Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of a sample statistic. The alternate hypothesis is typically chosen as: A : i = 0 although one-sided tests looking for either a positive or negative slope are possible. The test-statistics is found as T = bi 0 se(bi )
and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This is usually automatically done by most computer packages. The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relationship were true. It is also possible to obtain test for set of predictors (e.g. can several X variables be simultaneously dropped from the model) as will be seen later in the notes. Finally, if there are a large number of X variables, is there an objective way to decide which subset of the X are useful in predicting Y ? Again, this is deferred until later in this chapter.
22.1.5
Predictions
Once the best tting line is found it can be used to make predictions for new sets of X. There are two types of predictions that are commonly made. It is important to distinguish between them as these two intervals are the source of much confusion in regression problems.
1436
CHAPTER 22. MULTIPLE LINEAR REGRESSION First, the experimenter may be interested in predicting a SINGLE future individual value for a particular set of X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses at a particular set of X.3 The prediction interval for an individual response is sometimes called a condence interval for an individual response but this is an unfortunate (and incorrect) use of the term condence interval. Strictly speaking condence intervals are computed for xed unknown parameter values; predication intervals are computed for future random variables. Both of the above intervals should be distinguished from the condence interval for the slope. In both cases, the estimate is found in the same manner substitute the new sets of X into the equation and compute the predicted value Y . In most computer packages this is accomplished by inserting a new dummy observation in the dataset with the value of Y missing, but the values of X present. The missing Y value prevents this new observation from being used in the tting process, but the X value allows the package to compute an estimate for this observation. What differs between the two predictions are the estimates of uncertainty. In the rst case, there are two sources of uncertainty involved in the prediction. First, there is the uncertainty caused by the fact that this estimated line is based upon a sample. Then there is the additional uncertainty that the value could be above or below the predicted line. This interval is often called a prediction interval at a new X. In the second case, only the uncertainty caused by estimating the line based on a sample is relevant. This interval is often called a condence interval for the mean at a new X. The prediction interval for an individual response is typically MUCH wider than the condence interval for the mean of all future responses because it must account for the uncertainty from the tted line plus individual variation around the tted line. Many textbooks have the formulae for the se for the two types of predictions, but again, there is little to be gained by examining them. What is important is that you read the documentation carefully to ensure that you understand exactly what interval is being given to you.
22.1.6
Example: blood pressure
Blood pressure tends to increase with age, body mass, and potentially stress. To investigate the relationship of blood pressure to these variables, a sample of men in a large corporation was selected. For each subject, their age (years), body mass (kg), and a stress index (ranges from 0 to 100) was recorded along with their blood pressure. The raw data is presented in the following table:
3 There
is actually a third interval, the mean of the next m individuals values but this is rarely encountered in practice.
1437
CHAPTER 22. MULTIPLE LINEAR REGRESSION Age (years) 50 20 20 30 30 50 60 50 40 55 40 40 20 31 32 JMP Analysis The raw data is also available in a JMP data sheet called bloodpress.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data has been entered with rows corresponding to the different subjects and columns corresponding to the different variables: Blood Pressure (mm) 120 141 124 126 117 129 123 125 132 123 132 155 147 . 146 Body Mass (kg) 55 47 33 65 47 58 46 68 70 42 33 55 48 53 59 Stress Index (no units) 69 83 77 75 71 73 67 71 77 69 74 86 84 86 .
1438
Notice that the response variable is continuous as are the other variables.4 Also notice that the blood pressure value is missing for one subject it cannot be used in the analysis, but predictions can be made for this subject as all the X values are present. One subject is missing one of the X variables this subject cannot be used in the tting process nor for making predictions. The remaining sample size is only 13 subjects. As usual, the researcher needs to think why certain values are missing. It is also interesting to note that measurement error in the X variables could be a concern. For example, it is highly unlikely that the rst subject is exactly 20.000000 years old! People usually truncate their age when asked, e.g. even on the day before their 21st birthday, a person will still respond as their age being 20 years old. Here the error in aging ranges from about 5% of values (when age is around 20 years old) to about 2% (when the age is around 50 years old). How was weight collected? If the subjects were actually weighed, the actual number may not be in dispute (i.e. it is unlikely that the scale is wrong), but then the weight includes shoes, clothes, and ???? If the weight is a recall measurement, many people under report their actual weight, often by quite a margin. An how is stress measured? It is likely an index based on a survey, but it is not even clear how to numerically measure stress after all, stress cant simply be measured like temperature. Begin by plotting the variables against each other - a simple way is a scatter plot matrix available under the Analyze->MultiVariateMethods->Multivariate platform:
4 In actual fact, these variables have been discretized, but as the discretization interval is small relative to typical values, they can be treated as being continuous.
1439
1440
The scatter plot matrix shows no strong simple relationships between pairs of variables. Rather surprisingly, weight seems to decrease with age, and there appears to be general increase of blood pressure with weight. These pairwise scatter plots are primarily useful for checking for outliers and other problems in the data often a multivariate relationship is too complex to be seen in simple pairwise plots. We will t the model where the response variable (blood pressure) is modeled as a function of the three predictor variables (age, weight, and stress index). Using the short hand notation discussed earlier, the model 1441
CHAPTER 22. MULTIPLE LINEAR REGRESSION is BloodP ressure = Age W eight Stress
This model is t using the Analyze->Fit Model platform:
1442
The X variables can be listed in any order. The output from the Analyze->Fit Model is voluminous and cannot be display in one panel, so it is necessary to look at several parts in more details. Because of the missing values, only 13 subjects could be used in the model t:
1443
CHAPTER 22. MULTIPLE LINEAR REGRESSION The number of cases actually used in the t should alway be ascertained because in large datasets, the missing value pattern may not be easily discerned. First, assess the overall t of the model but examining the plot of the actual blood pressure vs. the predicted blood pressure. If the model made exact predictions, then the points on the plot would all lie perfect on the 45 degree line. The plot from this t:
shows the most points lie fairly close to the 45 degree line. As well there are no points that appear to have undue leverage on the plot as there is a general scatter around the 45 degree line. The residual plot:
1444
also shows a random scatter of residuals around the value of 0 with no apparent pattern. The whole model test, i.e. if any of the X variables provide information on predicting Y is found in the Analysis of Variance Table:
The p-value is very small, and so there is good evidence that at least one X variable appears to predict the blood pressure. Of course, at this point, it is unclear which X variable are good predictors and which X variables may be poor predictors.. The tted regression equation is found by looking at the Parameter Estimates area:
1445
and is:
BloodP ressure = 61.3 + .45(age) .087(stress) + 2.37(weight) These coefcients are interpreted as the MARGINAL increase in the blood pressure when each variable changes by 1 unit AND ALL OTHER VARIABLES REMAIN FIXED. For example, the coefcient of 0.45 for age indicates that the estimated blood pressure increases by .45 units for year increase in age assuming that the stress index and weight remain constant. The concept of marginality, i.e. the marginal increase in Y when a single X variable is changed but all other X variables are held xed is the crucial concept in multiple regression. In some cases, for example, polynomial regression, it is impossible to hold all other X variables xed as you will see later in this chapter. The sign of the coefcient for stress is somewhat surprising, but as you will see in a few minutes is nothing to worry about. Are there any X variables that dont appear to be useful in predicting blood pressure? The Effect Tests or the Parameter Estimates table provide some clues:
The p-values from the the Effect Test table or the the Parameter Estimates table are identical; the F -statistic is simply the t-ratio squared. These are MARGINAL tests, i.e. is a particular X variable useful in predicting the blood pressure give all other variables remain in the model. For example, the test for age examines if blood pressure changes with age after adjusting for stress and weight. The test for Stress examines if blood pressure changes with stress after adjusting for age and weight. In this example, the p-value for stress appears to be statistically not signicant. This would imply that blood pressure does not seem to increase with stress after adjusting for age and weight. This would indicate
1446
CHAPTER 22. MULTIPLE LINEAR REGRESSION that perhaps stress could be dropped from the model and a nal model only using age and weight may be suitable. Consequently, the negative sign on the coefcient is not really worry some. Again, this concept of marginality is crucial for the proper interpretation of the statistical tests. If two X variables are related, it is possible that both of the statistical tests could be non-signicant, but this does not imply that both variables can be dropped from the model. Later in this chapter (Section 22.4), it will be shown how to test if multiple variables can be simultaneously dropped from the model. The leverage plots should also be examined to see that any relationship between the predictor and response variable is not highly dependent upon a single (high leverage) point:
Leverage plots in general, examine the new information for each X variable in predicting Y after adjusting for all the other variables in the model. The general theory is presented in Section 22.2.6. Two features of the plot should be examined. The general statistical signicance of the X variable is found by considering the slope of the line and if the condence curves contain the horizontal line :
1447
CHAPTER 22. MULTIPLE LINEAR REGRESSION We see that the condence curves in the leverage plots for age and weight both do not contain the horizontal line. However, the condence curve on the leverage plot for stress includes the horizontal line indicating that this variables contribution to predicting blood pressure is not statistically useful. The second feature of leverage plots that should be examine is the distribution of points along the X axis of the leverage plot. There should be a fairly even distribution along the bottom axis and the tted line in the leverage plot should not be heavily inuenced by a few points with high leverage. By clicking on the red-triangle associated with the t:
it is possible to save various predictions to the data table. For example, save the predicted values and the two types of condence intervals (for the mean and for individuals):
1448
Notice that for observation 14, only the blood pressure was missing and so predictions of the blood pressure for that individual can be made. However, for individual 15, at least one of the X variables had a missing value and so no predictions can be made. The prediction are simply found by substituting in the X values into the prediction equation. As in simple linear regression, there are two different condence intervals. The condence interval for the MEAN response would be useful to predicting the average blood pressure over many people with the same values of X as recorded. The condence interval for the INDIVIDUAL response would be useful to predict the blood pressure for a single future individual with those particular X values. A common error is to confuse these two types of intervals. As in simple linear regression, a common way to make predictions it to add rows to the end of the data table with the Y variable deliberately set to missing with the X values set to those of interest. These rows are NOT used in the model tting, but because the X set is complete, predictions can be made. If the residuals are saved to the data table, a normal probability plot of the residuals can be made using the Analyze->Distribution platform on the saved residuals. Similarly, the residuals can be plotted against each X variable in turn to assess if there is a linear marginal relationship between Y and each X variable. Each of these residual plots should show a random scatter around zero. It is also possible to do inverse predictions, but this is beyond the scope of these notes. There are lots of other interesting features to the Analyze->Fit Model platform that are beyond the scope
1449
CHAPTER 22. MULTIPLE LINEAR REGRESSION of these notes.
22.2
22.2.1
Regression problems and diagnostics

Introduction
All models are wrong, but some are useful. G.E.P. Box This famous quote implies that no study ever satises the assumptions made when modeling the data. However, unless the violations are extreme, perhaps the model can be useful to make predictions. In this section, we will take a detailed look at a number of diagnostic measures to assess the t of our model to the data.
22.2.2
Preliminary characteristics
Before building complex models, the analyst should become familiar with the basic properties of their data. This is accomplished by: Examine the RRRs of experimental and survey design as it relates to this study. What are the scale (nominal, ordinal, interval, ratio) of each variable? Which are the predictors and response variables? What are the types (discrete, continuous, discretized continuous) of each variable? Then do some basic plots and tabulations to spot potential problems in the data: Missing values. Examine the pattern of missing values. Most regression packages practice case-wise deletion, i.e. any observation (row) that is missing any of the X or the Y variable is not used in the analysis. If you have a large dataset with many X variables, even a small percentage of missing values can lead to many rows being deleted from the analysis. Think about how the missing values came about - are the MCAR, MAR, or IM? JMP has a nice feature to tabulate the pattern of missing values under the Tables menu. Single variable descriptive statistics For each variable in the dataset, do some basic descriptive statistics and plots (e.g. histograms, dot-plots, box-plots) to identify potentially extreme observations. Check to see that all values are plausible, e.g. if one variable records the sex of the subject, only two possible values should be recorded; It is unlikely that a women has 20 natural children; it is unlikely that a human male is more than 3 m tall, etc.
1450
CHAPTER 22. MULTIPLE LINEAR REGRESSION Pairwise plots Create bivariate plots of all the variables. Check to unusual looking observations. These may be perfectly valid observations, but they should be examined in more detail to make sure. A casement plot (a matrix of pairwise scatter plot) can be created easily in JMP using the Analyze>MultiVariateMethods->Multivariate platform.
22.2.3
Residual plots
After the model is t, compute the residuals which are simply the VERTICAL difference between the observed and predicted values, i = Yi Yi . Most computer packages will compute and plot residuals easily. The basic assumption about the VERTICAL discrepancies was that they have a mean of zero and a CONSTANT variance 2 . We estimated the variance by MSE in the ANOVA table. There are several different types of residuals that can be computed and plotted: Standardized residual. This is simply computed as zi = i / M SE and is an attempt to create residuals with a mean of 0 and a variance of 1, i.e. like a standard normal distribution. Because all the residuals are divided by the same value, the pattern seen in the standardized residuals will the same as seen in the ordinary residuals. Studentized residual. The precision of the predictions changes at different parts of the regression line. You saw earlier that the condence band for the mean response got wider as the prediction point moved further away from the center of the data. The studentized residuals (see book for computational details) attempts to standardize each residual by its approximate precision. Because each residual is adjusted individually, the plots of the studentized residuals will look slightly different from that of the regular or standardized residuals, but they will be similar. Jackknifed residuals. Less commonly computed, jack-knifed residuals are computed by tting a regression line after dropping each point in turn, and then nding the residual. For example, if there were 4 data points, the jack-knifed residual for the rst point would be the difference between the observed value and the predicted value based on a regression line t to points 2, 3 and 4 only. The jack-knifed residual for the second observation would be the difference between the observed value and the predicted value based on the 1, 3 and 4th observations. Plots based on these residuals will appear similar, but not exactly the same as plots based on the the other residuals. Several plots can be constructed. First, look at the univariate distribution of the residuals. Which observations correspond to the largest negative and positive residuals? Second, plot the residuals against each predictor variable, against the PREDICTED Y values, and against the order with which the data were collected (this may be, but is not necessarily the order of the observations in the dataset). Dont plot the residuals against the observed Y values because you will see strange patterns that are artifacts of the plot.5 A good residual plot will show random scatter around zero; bad residual plots
5 Basically negative residuals will be associated with smaller Y values and these will increase as Y increases, and then crash and rise and then crash and rise again.
1451
CHAPTER 22. MULTIPLE LINEAR REGRESSION will show a denite pattern. Typical residual plots are illustrated below with small datasets, the patterns will not be as clear cut.
With small datasets, dont over analyze the plots - only gross deviations from ideal plots are of interest. Modern alternatives to residual plots are to plot the absolute value of the residuals and t LOWESS curves through them. Consult our Stat400 course (Data Analysis) for details. Many books present formal tests for residuals I nd these not particularly useful, and prefer the simple residual plots. However, one useful diagnostic is the Durbin-Watson test for autocorrelation consult the
1452
CHAPTER 22. MULTIPLE LINEAR REGRESSION chapter on trend analysis in this collection for details. Finally, many books also present what are known a normal probability plots to assess the normality of the residuals. Again, I have found these to be less than useful.
22.2.4
Actual vs. Predicted Plot
In multiple regression, it is very difcult to look at plots of Y vs. each X variable and come to anything very useful. In general, you are trying to view a multi-dimension space in two dimensions. A plot of the actual Y vs. the predicted Y s is useful to assess how well the model does in predicting each observations. This plot is produced automatically by JMP and many other packages. In some packages, you will have to save the predicted values and do the plot yourself.
22.2.5
Detecting inuential observations
An inuential observation is dened as an observation whose deletion greatly changes the results of the regression. There are many techniques available for spotting individual inuential points however, many of these methods will fail to detect pairs of inuential points in close proximity to each other.
Cooks D One popular measure of an observations inuence is the Cooks Distance. This statistic measures the extent to which the regression coefcients change when each individual observation is deleted. It is a summary measure of the impact of the observations deletion and is a weighted sum6 of: (0 0(i) )2 , (1 1(i) )2 , . . . , (k k(i) )2 where k(i) is the regression coefcient for the k th variable after dropping the ith observation. If a point has no effect on the t, then Di will be zero. Large values of Di indicate points that have a large inuence on the t. There is no easy rule for determining which values of Di are extreme.7 A general rule of thumb is to look at the distribution of the D s and look at those observations corresponding to extreme values.
to the original paper for the exact formula. often quoted rule is to look at values of Di that are greater than 1, but recent work has shown that this rule does not perform effectively
7 An 6 Refer
1453
CHAPTER 22. MULTIPLE LINEAR REGRESSION Hats Oddly named statistics are the Hats, or leverage values. These are computed under the idea that if a point has extreme inuence, the regression should predict it exactly. Consequently, the hats are computed from what is known (for historical reasons) the hat matrix which is dened as X (X X)1 X and should not be attempted to be computed by hand! If the hat-value is larger than about twice the average hat-value, then is usually taken to indicates an inuential point. There are more formal rules to checking the hat values but these are seldom worthwhile.
Caution It is clear that some observations must be the most extreme in every sample, and so it would be silly to automatically delete these extreme observations without a careful consideration of the the underlying data! The purpose of Cooks D and other similar statistics is to warn the analyst that certain observations require additional scrutiny. Dont data snoop simply to polish t!
22.2.6
Leverage plots
These are likely the most useful of the diagnostic tools for spotting inuential observations and are produced by many computer packages. The Leverage Plots produced by JMP are example of what are also called partial regression plot or adjusted variable plots. They are constructed for each individual variable. Suppose that we are regressing Y on four predictors X1 , . . . , X4 . The leverage plot for X1 is constructed as follows:. 1. Find the residuals when Y is regressed against all the other variables except X1 , i.e. t the model Y = X2 X3 X4 . Denote this residual as Y |X(1) where the -1 indicates that the rst variable was dropped from the set of X s. 2. Find the residuals when X1 is regressed against the other X variables, i.e. t the model X1 = X2 X3 X4 . Denote this residual as X|X(1) where the -1 indicates that the rst variable was dropped from the set of X s. 3. Plot the rst residual against the second residual for each observation.8 . Now if X1 has no further information about Y (after accounting for the other X s), then the X1 variable really isnt needed, and so all the rst residuals should be centered around zero with random scatter.
8 JMP
actually adds the mean of Y and X1 to the residuals before plotting, but this does not change the shape of the plot.
1454
CHAPTER 22. MULTIPLE LINEAR REGRESSION But suppose that X1 is important in predicting Y . Then the residuals from the regression of Y on the other X variables should be missing the contribution of X1 and the residual plot should show an upward (or downward) trend relative to the other residuals. In fact, if you t a regression line to the leverage plot, the slope will equal the slope in the full regression model. If the contribution of X1 is not linear, then the plot will show a non-linear relationship. Why is X1 regressed against the other X variables? Recall that the interpretation of the slope in multiple regression is the MARGINAL contribution after adjusting for all other variables in the model. In other words, the slope reect the NEW information in X1 after adjusting for the other X s. How is the new information in X1 found yes, by regressing X1 against the other variables. For example, suppose that X1 was an exact copy of another variable in the dataset. Then the second residuals would all be zero, indicating no new information (why?). So, if the leverage plot shows a very thin vertical band of points, this may be an indication that a certain variable does NOT have useful marginal information, i.e. is redundant given the other variables. This condition is known as multi-collinearity and is discussed later in this chapter. If a single observation has high leverage, the leverage plot will show the observations as an outlier. The diagram below demonstrates some of the important cases for leverage plots:
1455
CHAPTER 22. MULTIPLE LINEAR REGRESSION In JMP, and many other packages, the points on these plots are hot-linked to the data sheet. By clicking on these points, you can identify the observation in the data sheet. The concept of leverage plots is sufcient important and non-obvious that a numerical example will be examined. In JMP, open the Fitness.jmp dataset from the JMP sample dataset library. This dataset consists of measurements taken on subjects on the age, weight, oxygen consumption, times to run a mile, and three measurements on their pulse rate. The rst few lines of the data le are:
Fit a model to predict oxygen consumption as the Y variable with age, weight, runtime, and the three pulse measurements as the X variables. The estimated slopes are:
and the leverage plot for Runtime is:
1456
To reproduce this leverage plot, rst t the model for oxygen consumption, dropping the run-time variable and save the residuals to the data sheet.
1457
1458
Next, regress run-time against the other X variables and save the residuals to the data sheet:
1459
This will give the data sheet with two new columns added:
Finally, plot the Residual of Oxygen on all but runtime vs. the Residual of runtime on other X variables and t a line through that plot using the Analyze->Fit Y-by-X platform:
1460
You will see that this plot looks the same as the leverage plot (but the Y and X axes are scaled slightly differently) and that the slope on this plot (-2.639) matches the estimated slope seen earlier. Leverage plots should be used with some caution. They will show the nature of the functional relationship with the variable, but not its exact form. As well, as these plots are after adjusting for other variables, a variety of curvature models should be investigated. As well, if the functional form of the other variables is incorrect (e.g. age2 is needed but has not been added to the model), then the true nature of the relationship may be missed. You can get JMP to save all the leverage pairs under the Save Columns pop-down menu.
1461
22.2.7
Collinearity
It is often the case that many of the X variables are related to each other. For example, if you wanted to predict blood pressure as a function of several variables including height and weight, there is a strong relationship between these two latter variables. When the relationship among the predictor variables is strong, they are said to be collinear. This can lead to problems in tting the model and in interpreting the results of a model t. In this example, it is conceivable that you could increase the weight of a subject while holding height constant, but suppose the two variables were total hours of sunshine and total hours of clouds in a year. If one increases, the other must decrease. Because the regression coefcients are interpreted as the MARGINAL contribution of each predictor, collinearly among the predictors can mask the contribution of a variable. For example, if both height and weight are t in a model, then the marginal contribution of height (given weight is already in the model) is small; similarly, the marginal contribution of weight (given height is in the model) is also small. However, it would not be valid to say that the marginal contribution of both height and weight (together) are small. In Section 22.4 methods for testing if several variables can be deleted simultaneously from the model are presented. If the predictor variables were perfectly collinear, the whole model tting procedure breaks down. It turns out that a certain matrix used in the model tting cannot be numerically inverted (similar to trying to divide by zero) and no estimates are possible. If the variables are not perfectly collinear, many different sets of estimates can be found that give very nearly the same predictions! Not all the story is bad multicollinearity does not imply that the whole regression model is useless. Even if predictor variables are highly related, good predictions are still possible provided that you make predictions at values of X that are similar to those used in model tting. The basic tool for diagnosing potential collinearity is the variance ination factor (VIF) for each regression coefcient. In JMP this is obtained by right-clicking on the table of parameter estimates after the Analyze->Fit Model platform is run. For example, the VIF for the tness dataset are:
1462
The VIF is interpreted as the increase in the variance (se 2 ) of the estimate compared to what would be expected if the variable was completely independent of all other predictor variables. The VIF equals 1 when a predictor is not collinear with other predictors. VIFs that are vary large, typically around 10 or higher, are usually taken as an indication of potential collinearity. In the tness dataset, there is evidence of collinearity in the average pulse rate during the run (Run Pulse) and the maximum pulse rate during the run (Max Pulse) variables. This is not unexpected. If collinearity is detected, remedial measures include dropping some of the redundant predictor variables,9 or more sophisticated tting methods such as ridge or robust regression (which are beyond the scope of this course).
9 An obvious question is how do you tell which variables are redundant? Common methods are principal component analysis of the X variables, or examining the correlation among the predictors. Seek help if you run into a problem of extreme multicollinearity.
1463
22.3
22.3.1
Polynomial, product, and interaction terms

Introduction
The assumption of a marginal linear relationship between the response variable and the X variable is sometimes not true, and a quadratic and (rarely) cubic or higher polynomials are often t in terms of X in order to approximate this non-linear relationship. The basic way to deal with polynomial regression (i.e. quadratic and higher terms) is to create new predictor variables involving X 2 , X 3 , . . .. Although not necessary with modern software, it is often a good idea to center variables that will be used in quadratic and higher relationship to avoid a high degree of collinearity among the terms. For example, replace X and X 2 by (X X) and (X X)2 respectively While the actual coefcients may change, the p-value for testing the linear and quadratic slope are unaffected, and predictions are also unaffected this is exactly analogous to a unit change between imperial and metric units. The model t is
2 Yi = 0 + 1 Xi1 + 2 Xi1 + i
If the square term is called X2 , the model is: Yi = 0 + 1 Xi1 + 2 Xi2 + which now looks exactly like a ordinary multiple regression model. The rest of the model tting, testing, etc proceeds exactly as outlined in previous sections. However, there are two potential problems with polynomial models. Models should be hierarchical. This means that if you include a term involving X 2 in the model, you must include a term involving X. If you include a quadratic, but not the linear term, you are restricting the quadratic curve to be a very special shape which is not usually wanted in practice. This will be outlined in class. The interpretation of the estimates must be done with care. Normally, the estimated slopes are the MARGINAL contribution of this variable to the response, i.e. after holding all other variables constant. However, if the regression equation includes both X and X 2 terms, it is impossible to hold X xed while changing X 2 alone. What degree of polynomial is suitable? This is usually done by tting successively higher polynomial terms until the added term is no longer statistically signicant and then using the previous model. While polynomial models allow for some degree of curvature in the response, it is very rare to t terms involving cubic and higher powers. The reason for this is that such curves seldom have biological plausibility, and they have wide oscillations in their predicted values. The researcher should also investigate if a transform of the Y or X variable may linearize the relationship. For example, a plot of log(Y ) vs. X may show a linear t. Similarly, 1/X may be a more suitable
1464
CHAPTER 22. MULTIPLE LINEAR REGRESSION predictor.10 It is possible to use least squares to actually t non-linear models where no transformation or polynomial terms provide a good t. This is beyond the scope of this course.
22.3.2
Example: Tomato growth as a function of water
An experiment was run to investigate the yield of tomato plants as a function of the amount of water provided over the season. A series of plots were randomized to different watering levels and at the end of the season, the yield of the plants was determined. The raw data follows: Water 6 6 6 6 6 8 8 8 8 8 10 10 10 10 10 12 12 12 12 14 14 14 14 14
10 For example,
Yield 49.2 48.1 48.0 49.6 47.0 51.5 51.7 50.4 51.2 48.4 51.1 51.5 50.3 48.9 48.7 48.6 48.0 46.4 46.2 43.2 42.6 42.1 43.9 40.5
should fuel economy of car be measured a miles/gallon (distance/consumption) or L/100 km (consumption/distance).
1465
CHAPTER 22. MULTIPLE LINEAR REGRESSION JMP Analysis: The raw data is also available in a JMP data sheet called tomatowater.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data is entered into JMP in the usual fashion columns represent variables and rows represent observations. The scale of both variables should be continuous.
As usual, begin with a plot of the data:
1466
The relationship is clearly non-linear and looks as if a quadratic may be suitable. Before tting the model, think about the assumptions required for the t and assess if these are suitable to the data at hand. There are two ways to t simple (i.e. only involving polynomial terms in X) polynomial models in JMP. If your regression model is a mixture of polynomial and other X variables, then the second method must be used. In the rst method, the Analyze->Fit Y-by-X platform can be used directly. For example, select the platform:
1467
and choose Polynomial Fit:
1468
which gives a plot of the tted line:
1469
and statistics about the t:
1470
The tted curve is: Y ield = 57.726857 0.762W ater 0.2928571(W ater 10)2 Notice that JMP has automatically centered the quadratic term by subtracting the mean X of 10 from each value prior to squaring. As you will see in a few minutes, this has no affect upon the test of signicance of the quadratic term, nor on the actual predicted values. The ANOVA table can be used to examine if either/or the linear or quadratic terms provide any predictive power. The table of estimates shows that the quadratic term is clearly statistically signicant. Condence intervals for the regression coefcients can be found in the usual fashion by right clicking in the table and requesting the appropriate columns (not shown). A residual plot is obtained in the usual fashion:
1471
which shows no evidence of a problem. If a cubic polynomial is t (in the same fashion as the quadratic polynomial) you will see that the cubic
1472
CHAPTER 22. MULTIPLE LINEAR REGRESSION term is not statistically signicant indicating that a quadratic model is sufcient. Condence band for the mean response at each X and the individual response at each X can also be obtained in the usual way:
1473
Again, the scientist must understand the difference between the condence bounds for each type of prediction as outlined in earlier chapters. The second way to t polynomial models (and the only way when polynomial terms are intermixed with other variables) is to use the Analyze->Fit Model platform. First variables corresponding to X 2 and X 3 (if needed) must be created using the formula editor of JMP:11
It is preferable to use JMPs formula editor rather than creating these variables outside of the data sheet because these columns will be hot-linked to the original column. If, for example, a value of X is updated, then the values of the squared and cubic terms will also be updated automatically.
11
1474
1475
and a portion of the resulting data table is shown below:
1476
CHAPTER 22. MULTIPLE LINEAR REGRESSION Note that the X variable was centered before squaring and cubing. Now use the Analyze->Fit Model platform to t using the water and water-squared terms:
The plot of actual vs. predicted shows a good t:
1477
The ANOVA table (not shown) can be used to assess the overall t of the model as seen in earlier sections. The estimates match those seen earlier, as do the p-values:
Condence intervals for the regression coefcients can be found in the usual fashion by right clicking in the table and requesting the appropriate columns (not shown). The leverage plot for the X 2 term shows that this polynomial is required and is not inuenced by any unusual values:12
12
Because of the hierarchical restriction, the leverage plot for the linear term is not of interest.
1478
Condence intervals for the mean response or individual responses are saved to the data table in the usual fashion (but are not shown in these notes):
1479
Finally, getting a plot of the actual tted line a bit of work if using the Analyze->Fit Model platform. First, save the predicted values to the data table:
1480
Then use the Overlay Plot under the Graph menu to plot the individual points and the predicted values:
1481
and then join up the predicted values (and remove the tted points)
1482
1483
CHAPTER 22. MULTIPLE LINEAR REGRESSION to nally give the plot that we saw earlier (whew!) but unfortunately, there does not appear to be anyway to drawing a smooth curve short of getting predictions for many points between the observed values of X and drawing the curve through the smaller increments.
22.3.3
Polynomial models with several variables
The methods of the previous section can be extended to cases where several variables have quadratic or higher powers. It is also possible to include crossproducts of these variables as well. There are no conceptual difculties in having multiple polynomial variables. However, the analyst must ensure that models are hierarchical (i.e. if higher powers or cross products are includes, then lower order terms must also be included). Consequently, leverage plots of the lower order terms are likely not be very useful when higher order terms are included in the model. In practice, polynomial models are commonly restricted to quadratic terms or lower. The goal is not
1484
CHAPTER 22. MULTIPLE LINEAR REGRESSION so much as to elucidate the underlying mechanism of the response, but rather to get a good approximation to the response surface. indeed, there are a whole suite of techniques (commonly called response surface methodology) used to t and explore polynomial models in this context. Often predictions of where the maximum or minimum response is found are important. There are many excellent books available. JMP also has specialized tools in the Analyze->Fit Model platform to assist in the tting of response surfaces. These are beyond the scope of these notes
22.3.4
Cross-product and interaction terms
Recall that the interpretation of the regression coefcient associated with the ith predictor variable is the marginal (i.e. after keeping all other variables in the model and xed) increase in Y per unit change in Xi . This marginal increase is the same regardless of the values of the other X variables. But sometimes, the contribution of the ith variable depends upon the value of another, the j th , predictor. For example, suppose the blood pressure tends to increase by .5 units for every kg increase in body mass for people under 1.5 m in height, but tends to increase by .6 units for very kg increase in body mass for people over 1.5 m in height. We would say that the body mass interacts with the height variable. This concept is very similar to analogous interaction of factors in ANOVA models.13 Consider a model where blood pressure depends upon age and height via the model: BP = AGE HEIGHT This corresponds to the formal statistical model of: Yi = 0 + 1 AGEi + 2 HEIGHTi +
i
You can see that if age increases by 1 unit, then the value of Y increases by 1 units regardless of the value of height. Similarly, every time height increases by 1 unit, Y increases by 2 regardless of the value of age. Now consider the model written as: BP = AGE HEIGHT AGE HEIGHT which corresponds to the formal statistical model of: Yi = 0 + 1 AGEi + 2 HEIGHTi + 3 AGEi HEIGHTi +
i
The crossproduct of age and height enter into the model as new predictor variable.14 Now look what happens when age is increased by 1 unit. The value of Y increases not simply by 1 but by 1 + 3 HEIGHTi . Now when height is small, the increase in Y per unit change in age is smaller than when height is large.
this is not surprising as ANOVA is actually a special case of regression. The actual X matrix would then have four columns. Column 1 would consists of all 1s; column 2 would consist of the values of age; column 3 would consist of the values of height; and column 4 would contain the actual products of age and height for each individual.
14 13 Indeed,
1485
CHAPTER 22. MULTIPLE LINEAR REGRESSION Similarly, an increase by 1 unit in the value of height will lead to an increase by 2 + 3 AGEi . The effect of height will be less for younger subjects than for older subjects. The use of product terms in multiple-regression can be easily extended to products involving more than two variables, and, more importantly as discussed in Section 22.5.3, to products with indicator variables. There is no real problem in tting these models other than the model must conform to the hierarchical principle. This principle states that if terms like Xi Xj are in the model, so must be all lower order terms in this case, both Xi and Xj as separate terms must remain in the model. This is the same principle as you saw for polynomial models.
22.4
22.4.1
The general linear test

Introduction
In previous sections, you say how to test if a specic regression coefcient in the population was zero using the t-test provided by most computer packages. It is tempting then, to try and test if multiple X variables can be dropped simultaneously if their individual p-values are all not statistically signicant. Unfortunately this strategy often fails. The basic reason for its failure is that very often regression coefcients are highly interrelated because their corresponding X variables are not orthogonal to each other. For example, suppose that both height and weight were X variables in a model that was trying to predict blood pressure. The test of the hypothesis for the slope for weight and height are MARGINAL tests, i.e. is the slope associated with weight in the population zero assuming that all other variables (including height) are retained in the model. Because of the high interdependency between height and weight, the p-value for the test of marginal zero slope for weight may not be statistically signicant. Similarly, the p-value for the test of marginal zero slope for height (assuming that weight is in the model) may also be statistically non-signicant. However, both height and weight cannot be simultaneously removed from the model. In order to test if a set of predictor variables can be simultaneously removed from the model, a General Linear Test is performed. The theoretical mechanics of the test are: 1. Fit the full model, i.e. with all variables present. Find SSEf ull from the full model. 2. Fit the reduced model, i.e. dropping the variables of interest. Find SSEreduced from the reduced model. 3. If the reduced model is still an adequate t, then SSEreduced should be very close to SSEf ull after all, if the dropped variables were not important, then the reduction in prediction error should be small. Construct a test statistic as: Fgeneral =
SSEreduced SSEf ull dfSSEreduced dfSSEf ull SSEf ull dfSSEf ull
1486
CHAPTER 22. MULTIPLE LINEAR REGRESSION This is compared an F -distribution with the appropriate degrees of freedom. Large values of the F -statistic indicate evidence that not all variables can be simultaneously dropped. Of course, this procedure has been automated in most statistical packages as will be illustrated by an example.
22.4.2
Example: Predicting body fat from measurements
The percentage of body fat in humans is a good indicator of future problems with cardiovascular and other diseases. The following was taken from Wikipedia:15 Body fat percentage is the fraction of the total body mass that is adipose tissue. This index is often used as a means to monitor progress during a diet or as a measure of physical tness for certain sports, such as body building. It is more accurate as a measure of health than body mass index (BMI) since it directly measures body composition and there are separate body fat guidelines for men and women. However, its popularity is less than BMI because most of the techniques used to measure body fat percentage require equipment and skills that are not readily available. The most accurate method has been to weigh a person underwater in order to obtain the average density (mass per unit volume). Since fat tissue has a lower density than muscles and bones, it is possible to estimate the fat content. This estimate is distorted by the fact that muscles and bones have different densities: for a person with a more-than-average amount of bone tissue, the estimate will be too low. However, this method gives highly reproducible results for individual persons 1%. The body fat percentage is commonly calculated from one of two formulas: Brozek formula : BF = (4.57/p 4.142) 100 Siri formula : BF = (4.95/p 4.50) 100 In these formulas, p is the body density in kg/L obtained by weighing the person out of water and then dividing by the volume obtained by dunking the person underwater. BTW, the American Council on Exercise has associated categories with ranges of body fat. Women generally have less muscle mass than men and therefore they have a higher body fat percentage range for each category. Description Essential fat Athletes Fitness Acceptable Obesity
15 2006-05-15,
Women 10-13% 14-20% 21-24% 25-31% 32%+
Men 2-5% 6-13% 14-17% 18-24% 25%+
at http://en.wikipedia.org/wiki/Body_fat_percentage
1487
CHAPTER 22. MULTIPLE LINEAR REGRESSION Many studies have been done to see if predictions of body fat can be made based on simple measurements such as circumferences of various body parts. A study of middle age men measured the percentage of body fat using the difcult methods explained above and also taking measurements of the circumference of their thigh, triceps, and mid-arm. Here are the raw data: Triceps 19 24 30 29 19 25 31 27 22 25 31 30 18 19 14 29 27 30 22 25 JMP Analysis The raw data is also available in a JMP data sheet called bodyfat.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Fit the multiple-regression model using the Analyze->Fit Model platform: Thigh 43 49 51 54 42 53 58 52 49 53 56 56 46 44 45 54 55 58 48 51 Mid-arm 29 28 37 31 30 23 27 30 23 24 30 28 23 28 21 30 25 24 27 27 PerBodyFat 11.9 22.8 18.7 20.1 12.9 21.7 27.1 25.4 21.3 19.3 25.4 27.2 11.7 17.8 12.8 23.9 22.6 25.4 14.8 21.1
1488
The resulting estimates all have tests for the marginal population slope statistically non-signicant:
But at the same time, the whole model test:
1489
shows that there is predictive ability in these X variables because the overall p-value is statistically signicant. The problem is that the X variables are all highly related. Indeed a scatter plot matrix of their X variables shows a high degree of relationship among them:
1490
A general-linear test for dropping, say both the triceps and thigh X variables is constructed using the Custom Tests pop-down menu item:
1491
and then specifying which X variables are to be tested together. You need a separate column in the Custom Test for each variable to be tested if you specify multiple variables in a single column, you will get a test for a crazy hypothesis:
The nal result:
1492
has a p-value of .000003 which is very strong evidence that both variables cannot be dropped simultaneously. If you look at the ANOVA table from the full model:
the SSEf ull = 100.1 with 16 df . The reduced model is t using the Analyze->Fit Model platform and with just the Mid-arm variable, and the reduced model ANOVA table is: 1493
with the SSEreduced = 487.4 with 18 df . The general linear test is found as:
SSEreduced SSEf ull dfSSEreduced dfSSEf ull SSEf ull dfSSEf ull 487.4100.1 1816 100.1 16
Fgeneral =
193.65 = 30.94 6.3
which is the value reported above.
22.4.3
Summary
The general linear test is often used to test if a chunk of X variables can be removed from the model. Often this chunk will be set of variables that has something in common. For example, often all quadratic terms are tested simultaneously, or a variable and all it higher order interactions terms (e.g. X, X 2 , X 3 , etc.).
22.5
22.5.1
Indicator variables
Introduction
Indicator variables (also known as dummy variables) are a device to incorporate nominal-scaled variables into regression contexts. For example, suppose you looked at the relationship between blood pressure and weight. In general blood pressure of individual increases with weight. But in general, males are larger than females, so a body weight of 90 kg may have a different effect for males than for females. So how can sex (a nominally scaled variable) be incorporated into the regression equation? It turns out that using indicator variables makes ordinary regression a general tool for many more applications than simply regression. Indeed, it is possible to show that two-sample t-tests, single factor completely 1494
CHAPTER 22. MULTIPLE LINEAR REGRESSION randomized design ANOVAs, and even more complex experimental designs can be analyzed using regression methods. This is why many computer packages call their analysis tool for comparing means and tting regressions as variants of general linear models.
22.5.2
Dening indicator variables
Unfortunately, there is no standard way to dene an indicator variable in a regression setting, but fortunately, it turns out that it doesnt matter which formulation is used it is always possible to get an appropriate answer. In general, if a nominally scaled variable has k categories, you will require k 1 indicator variables. In many cases, computer packages will generate these automatically if the package knows that variable is to be treated as a nominally scaled variables.16 For example, as sex only has two levels, only one indicator variable is required. It could be coded as: X1 = or X1 = Many other codings are possible. For a nominally scaled variable with three levels, two indicator variables will be needed. For example, suppose that the size of a person is classied as small, medium, or large. Then the indicator variables could be dened as: X1 = 1 0 if small if medium or large X2 = 1 0 if medium if small or large 1 1 if male if female 1 0 if male if female
Now the pair of variables dene the three classes as: (X1 , X2 ) = (1, 0)=small, (X1 , X2 ) = (0, 1)=medium, and (X1 , X2 ) = (0, 0)=large. Many packages use what is known as reference coding rules for indicator variables, where the ith indicator variable take the values of 1 to indicate the ith value of the variable for the rst k 1 values of the variable, and all the indicator variables take the value 0 to refer to the last value of the variable.17 So, how do indicator variables help incorporate the effects of a nominally scaled variable? Consider the variable sex (taking two levels labeled f and m in that order). A single indicator variable, say Sex,
16 That is why it is good practice to code nominally scaled variables using alphanumeric codes (e.g. m and f for sex), rather than numeric codes such as 3 or 7. 17 Always check the package documentation carefully to see if the package is using this rule. If it uses a different coding scheme, you will have to interpret the estimates carefully.
1495
CHAPTER 22. MULTIPLE LINEAR REGRESSION is dened that takes the value of 1 for females and 0 for males. Now consider the following estimated regression equation: BloodP ressure = 110 10 Sex + .10 W eight The estimated blood pressure for a female who weighs 100 kg would be: 110 = 110 10(1) + .10(100) while the estimated blood pressure for a male who weighs 100 kg would be: 100 = 110 10(0) + .10(100) Hence, the coefcient associated with sex (with a value of -10) would be interpreted as the difference in blood pressure between females and males for all weight classes, i.e. the relationship consists of two parallel lines (with a slope against weight of .10) with a separation of 10 units. On the other hand, consider the regression equation: BloodP ressure = 110 10 Sex + .10 W eight .05 Sex W eight Notice that two variables (the Sex indicator variable and the weight variable) are multiplied together. Now, the estimated blood pressure for a female who weighs 100 kg would be: 105 = 110 10(1) + .10(100) .05 (1) (100) while the estimated blood pressure for a male who weighs 100 kg would be: 120 = 110 10(0) + .10(100) .05 (0) (100) Hence, the coefcient associated with the product of sex and weight would be interpreted as differential response to weigh between males and females, i.e. the relationship consists of two non-parallel lines. The slope for males against weight is .10 while the slope for females against weight is .10 .05 = .05. This idea can be extended to nominally scaled variables with more than two levels in a straightforward way. Fortunately, most packages will do the coding automatically for you and all that is necessary is to specify the model appropriately and understand what the various model formulations imply.
22.5.3
The ANCOVA model
The of indicator variables, has for historical reasons, been referred to as the Analysis of Covariance (ANCOVA) approach. It actual has two separate, but functionally identical uses. The rst use is to incorporate nominally scaled variables into regression situations. The modeling starts off with individual regression lines, one for each value in the nominal variable (e.g. a separate line for males and females). A statistical test is used to see if the lines are parallel. If there is evidence that the individual regression lines are not parallel, then a separate regression line must be used for each group for prediction purposes. If there is no evidence of non-parallelism, then the next task is to see if the lines are co-incident,
1496
CHAPTER 22. MULTIPLE LINEAR REGRESSION i.e. have both the same intercept and the same slope. If there is evidence that the lines are not coincident, then a series of parallel lines are used to make predictions. All of the data are used to estimate the common slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled together and a single regression line t for all of the data. The three possibilities are shown below for the case of two groups - the extension to many groups is obvious:
1497
1498
Second, ANCOVA has been used to test for differences in means among the groups when some of the variation in the responsible variable can be explained by a covariate. For example, the effectiveness of two different diets can be compared by randomizing people to the two diets and measuring the weight change during the experiment. However, some of the variation in weight change may be related to initial weight. Perhaps by standardizing everyone to some common weight, we can more easily detect differences among the groups. This will be discussed in a later chapter. A very nice book on the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance by G. A. Milliken and D. E. Johnson. Details are available at http://www.statsnetbase. com/ejournals/books/book_summary/summary.asp?id=869.
1499
22.5.4
Assumptions
As before, it is important before the analysis is started to verify the assumptions underlying the analysis. As ANCOVA is a combination of ANOVA and Regression, the assumptions are similar. Both goals of ANCOVA have similar assumptions: The response variable Y is continuous (interval or ratio scaled) The data are collected under a completely randomized design.18 This implies that the treatment must be randomized completely over the entire set of experimental units if an experimental study, or units must be selected at random from the relevant populations if an observational study. There must be no outliers. Plot Y vs. X for each group separately to see if there are any points that dont appear to follow the straight line. The relationship between Y and X must be linear for each group.19 Check this assumption by looking at the individual plots of Y vs. X for each group. The variance must be equal for both groups around their respective regression lines. Check that the spread of the points is equal around the range of X and that the spread is comparable between the two groups. This can be formally checked by looking at the MSE from a separate regression line for each group as MSE estimates the variance of the data around the regression line. The residuals must be normally distributed around the regression line for each group. This assumption can be check by examining the residual plots from the tted model for evidence of non-normality. For large samples, this is not too crucial; for small sample sizes, you will likely have inadequate power to detect anything but gross departures.
22.5.5
Comparing individual regression lines
You saw in earlier chapters, that a statistical model is a powerful shorthand to describe what analysis is t to a set of data. The model must describe the treatment structure, the experimental unit structure, and the randomization structure.. Let Y be the response variable; X be the continuous X-variable, and Group be the nominally scaled group variable with TWO levels, i.e. only one indicator variable will be generated, called I. In this and previous chapter, we uses a shorthand model notation. For example, the model notation Y =X would refer to a regression of Y on X with the underlying statistical model: Y = 0 + 1 X +
18 It 19 It
is possible to relax this assumption - this is beyond the scope of this course. is possible to relax this assumption as well, but is again, beyond the scope of this course.
1500
CHAPTER 22. MULTIPLE LINEAR REGRESSION where the subscript corresponding to individual subjects has been dropped for clarity. We now use an extension of model notation. The model notation: Y = X Group Group X refers to the model: Y = 0 + 1 X + 2 I + 3 I X + Lastly, the model notation: Y = X Group refers to the model Y = 0 + 1 X + 2 I + These can be diagrammed in a graphs. If the lines for each group are not parallel:
1501
CHAPTER 22. MULTIPLE LINEAR REGRESSION the appropriate model is Y 1 = X Group Group X The terms can be in any order. This is read as variation in Y can be explained a common intercept (never specied), with group effects (different intercepts), a common slope on X, and an interaction between Group and X which is interpreted as different slopes for each group. This model is almost equivalent to tting a separate regression line for each group. The only advantage to using this joint model compared to tting separate slopes is that all of the groups contribute to a better estimate of residual error. If the number of data points per group is small, this can lead to improvements in precision compared to tting each group individually. If the lines are parallel across groups, but not coincident:
the appropriate model is Y 2 = Group X The terms can be in any order. The only difference between this and the previous model is that this simpler model lacks the Group*X interaction term. It would not be surprising then that a statistical test to see if
1502
CHAPTER 22. MULTIPLE LINEAR REGRESSION this simpler model is tenable would correspond to examining the p-value of the test on the Group*X term from the complex model. This is exactly analogous to testing for interaction effects between factors in a two-factor ANOVA. Lastly, if the lines are co-incident:
the appropriate model is Y3=X . Now the difference between this model and the previous model is the Group term that has been dropped. Again, it would not be surprising that this corresponds to the test of the Group effect in the formal statistical test. The test for co-incident lines should only be done if there is insufcient evidence against parallelism. While it is possible to test for a non-zero slope, this is rarely done.
1503
22.5.6
Example: Degradation of dioxin
An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material. This material was discharged into waterways with the pulp-and-paper efuent where it bioaccumulated in living organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms takes a long time to degrade. Government environmental protection agencies take samples of crabs from affected areas each year and measure the amount of dioxins in the tissue. The following example is based on a real study. Each year, four crabs are captured from two monitoring stations which are situated quite a distance apart on the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs are composited together into a single sample.20 The dioxins levels in this composite sample is measured. As there are many different forms of dioxins with different toxicities, a summary measure, called the Total Equivalent Dose (TEQ) is computed from the sample. As seen in the chapter on regression, the appropriate response variable is log(T EQ). Is the rate of decline the same for both sites? Did the sites have the same initial concentration? Here are the raw data which are also available on the web in the SampleProgramLibrary available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
20 Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - the only loss of information is the among individual-sample variability which can be used to determine the optimal allocation between samples within years and the number of years to monitor.
1504
CHAPTER 22. MULTIPLE LINEAR REGRESSION Site a a a a a a a a a a a a a a b b b b b b b b b b b b b b Year 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 TEQ 179.05 82.39 130.18 97.06 49.34 57.05 57.41 29.94 48.48 49.67 34.25 59.28 34.92 28.16 93.07 105.23 188.13 133.81 69.17 150.52 95.47 146.80 85.83 67.72 42.44 53.88 81.11 70.88 log(TEQ) 5.19 4.41 4.87 4.58 3.90 4.04 4.05 3.40 3.88 3.91 3.53 4.08 3.55 3.34 4.53 4.66 5.24 4.90 4.24 5.01 4.56 4.99 4.45 4.22 3.75 3.99 4.40 4.26
The data is entered into JMP in the usual fashion. Make sure that Site is a nominal scale variable, and that Year is a continuous variable. In cases with multiple groups, it is often helpful to use a different plotting symbol for each group. This is easily accomplished in JMP by selecting the rows (say for site a) and using the Rows->Markers to set the plotting symbol for the selected rows:
1505
The nal data sheet has two different plotting symbols for the two sites:
1506
1507
CHAPTER 22. MULTIPLE LINEAR REGRESSION Before tting the various models, begin with an exploratory examination of the data looking for outliers and checking the assumptions. Each years data is independent of other years data as a different set of crabs was selected. Similarly, the data from one site are independent from the other site. This is an observational study, so the question arises of how exactly were the crabs were selected? In this study, crab pots were placed on the oor of the sea to capture the available crabs in the area. When ever multiple sets of data are collected over time, there is always the worry about common year effects (also known as process error). For example, if the response variable was body mass of small sh, then poor growing conditions in a single year could depress the growth of sh in all locations. This would then violate the assumption of independence as the residual in one site in a year would be related to the residual in another site in the sam year. You tend to see the residuals paired with negative residuals from the tted line at one site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have occured. Degradation of dioxin is relatively independent of external environmental factors and the variation that we see about the two regression lines is related solely to samplng error based on the particular set of crabs that that were sampled. It seems unlikely that the residuals are related.21 Use the Analyze->Fit Y-by-X platform and specify the log(T EQ) as the Y variable, and Y ear as the X variable:
21 If
you actually try and t a process error term to this model, you nd that the estimated process error is zero.
1508
CHAPTER 22. MULTIPLE LINEAR REGRESSION Then specify a grouping variable by clicking on the pop-down menu near the Bivariate Fit title line:
and selecting Site as the grouping variable:
1509
Now select the Fit Line from the same pop-down menu:
1510
to get separate lines t for each group:
1511
This relationships for each site appear to be linear. The actual estimates are also presented:
1512
1513
CHAPTER 22. MULTIPLE LINEAR REGRESSION The scatter plot doesnt show any obvious outliers. The estimated slope for the a site is -.107 (se .02) while the estimated slope for the b site is -.06 (se .02). The 95% condence intervals (not shown on the output but available by right-clicking/ctrl-clicking on the parameter estimates table) overlap considerably so the slopes could be the same for the two groups. The MSE from site a is .10 and the MSE from site b is .12. This corresponds to standard deviations of .10 = .32 and .12 = .35 which are very similar so that assumption of equal standard deviations seems reasonable. The residual plots (not shown) also look reasonable. The assumptions appear to be satised, so let us now t the various models. First, t the model allowing for separate lines for each group. The Analyze->Fit Model platform is used:
The terms can be in any order and correspond to the model described earlier. This gives the following output:
1514
The regression plot is just the same as the plot of the two individual lines seen earlier. What is of interest is the Effect test for the Site*year interaction. Here the p-value is not very small, so there is no evidence that the lines are not parallel. We need to ret the model, dropping the interaction term:
1515
which gives the following regression plot:
1516
This shows the tted parallel lines. The effect tests:
now have a small p-value for the Site effect indicating that the lines are not coincident, i.e. they are parallel with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both sites, but the initial concentration appears to be different. The estimated (common) slope is found in the Parameter Estimates portion of the output:
1517
CHAPTER 22. MULTIPLE LINEAR REGRESSION and has a value of -.083 (se .016). Because the analysis was done on the log-scale, this implies that the dioxin levels changed by a factor of exp(.083) = .92 from year to year, i.e. about a 8% decline each year. The 95% condence interval for the slope on the log-scale is from (-.12 -> -.05) which corresponds to a potential factor between exp(.12) = .88 to exp(.05) = .95 per year, i.e. between a 12% and 5% decline per year.22 While it is possible to estimate the difference between the parallel lines from the Parameter Estimates table, it is easier to look at the section of the output corresponding to the Site effects. Here the estimated LSMeans correspond to the log(TEQ) at the average value of Year - not really of interest. As in previous chapters, the difference in means is often of more interest than the raw means themselves. This is found by using the pop-down menu and selecting a LSMEans Contrast or Multiple Comparison procedure to give:
22 The
condence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table
1518
The estimated difference between the lines (on the log-scale) is estimated to be 0.46 (se .13). Because the analysis was done on the log-scale, this corresponds to a ratio of exp(.46) = 1.58 in dioxin levels between the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining, the dioxin levels are falling in both sites, but the 1.58 times ratio remains consistent.
1519
CHAPTER 22. MULTIPLE LINEAR REGRESSION Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual plot
dont show any evidence of a problem in the t.
22.5.7
Example: More rened analysis of stream-slope example
In the chapter on paired comparisons, the example of the effect of stream slope was examined based on: Isaak, D.J. and Hubert, W.A. (2000). Are trout populations affected by reach-scale stream slope. Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477. In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis was performed. In this section, we will use the actual stream slopes to examine the relationship between sh density and stream slope. Recall that a stream reach is a portion of a stream from 10 to several hundred meters in length that exhibits consistent slope. The slope inuences the general speed of the water which exerts a dominant inuence on the structure of physical habitat in streams. If sh populations are inuenced by the structure of physical habitat, then the abundance of sh populations may be related to the slope of the stream.
1520
CHAPTER 22. MULTIPLE LINEAR REGRESSION Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout populations, yet previous studies confound the effect of stream slope with other factors that inuence trout populations. Past studies addressing this issue have used sampling designs wherein data were collected either using repeated samples along a single stream or measuring many streams distributed across space and time. Reaches on the same stream will likely have correlated measurements making the use of simple statistical tools problematical. [Indeed, if only a single stream is measured on multiple locations, then this is an example of pseudo-replication and inference is limited to that particular stream.] Inference from streams spread over time and space is made more difcult by the inter-stream differences and temporal variation in trout populations if samples are collected over extended periods of time. This extra variation reduces the power of any survey to detect effects. For this reason, a paired approach was taken. A total of twenty-three streams were sampled from a large watershed. Within each stream, two reaches were identied and the actual slope gradient was measured. In each reach, sh abundance was determined using electro-shing methods and the numbers converted to a density per 100 m2 of stream surface. The following table presents the (ctitious but based on the above paper) raw data Estimates of sh density from a paired experiment slope Stream 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
slope class low high low high low high low high low high low high low high low
density (per 100 m2 ) 15.0 21.0 11.0 3.1 5.9 6.4 12.2 17.6 6.2 7.0 39.8 25.0 6.5 11.2 9.6
(%) 0.7 4.0 2.4 6.0 0.7 2.6 1.3 4.0 0.6 4.4 1.3 3.2 2.0 4.2 1.3 1521
CHAPTER 22. MULTIPLE LINEAR REGRESSION 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 4.2 2.0 3.6 0.7 3.5 2.3 6.0 2.5 4.2 2.3 6.0 1.2 2.9 0.7 2.9 1.1 3.0 2.2 5.0 0.7 3.2 0.7 3.0 0.3 3.2 2.3 7.0 1.8 6.0 2.2 6.0 high low high low high low high low high low high low high low high low high low high low high low high low high low high low high low high 17.5 7.3 10.0 11.3 21.0 12.1 12.1 13.2 15.0 5.0 5.0 10.2 6.0 8.5 7.0 5.8 5.0 5.1 5.0 65.4 55.0 13.2 15.0 7.1 12.0 44.8 48.0 16.0 20.0 7.2 10.1
Notice that the density varies considerably among stream but appears to be fairly consistent within each stream.
1522
CHAPTER 22. MULTIPLE LINEAR REGRESSION The raw data is available in a JMP datale called paired-stream.jmp in the Sample Programs Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot be randomized within stream the randomization occurs by selecting streams at random from some larger population of potential streams. As noted in the early chapter on Observational Studies, causal inference is limited whenever a randomization of experimental units to treatments cannot be performed. Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class (low and high slope), we will now use the actual slope. A simple regression CANNOT be used because of the non-independence introduced by measuring two reaches on the same stream. However, an ANOCOVA will prove to be useful here. First, it seem sensible that the response to stream slope will will be multiplicative rather than additive, i.e. an increase in the stream slope will change the sh density by a common fraction, rather than simply changing the density by a xed amount. For example, it may turn out that a 1 unit change in the slope, reduces density by 10% - if the density before the change was 100 sh/m2 , then after the change, the new density will be 90 sh/m2 . Similarly, if the original density was only 10 sh/m2 , then the nal density will be 9 sh/m2 . In both cases, the reduction is a xed fraction, and NOT the same xed amount (a change of 10 vs. 1). Create the log(density) column in the usual fashion (not illustrated here). In cases like this, the natural logarithm is preferred because the resulting estimates have a very nice simple interpretation.23 An appropriate model will be one where each stream has a separate intercept (corresponding to the different productivities of each stream - acting like a block), with a common slope for all streams. The simplied model syntax would look like log(density) = stream slope where the term stream represents a nominal scaled variable and gives the different intercepts and the slope is the effect of the common slope on the log(density). This is t using the Analyze->Fit Model platform as:
23 The JMP dataset also created a different plotting symbol for each stream using the Rows > Color or M ark by Column menu.
1523
Note that it stream must have a nominal scale and that slope must have a continuous scale. The order of the terms in the effects box is not important. The output from the Analyze->Fit Model platform is voluminous, but a careful reading reveals several interesting features. First is a plot of the common slope t to each stream:
1524
This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs. predicted values is clearer:
1525
Generally, the observed are close to the predicted values except for two potential outliers. By clicking on these points, it is shown that both points belong to stream 2 where it appears that increases in the slope causes a large decrease in density contrary to the general pattern seen in the the other streams. The effect tests:
fail to detect any inuence of slope. Indeed the estimated coefcient associated with a change in slope is found to be:
1526
is estimated to be .025 (se .0299) which is not statistically signicant.24 Residual plots also show the odd behavior of stream 2:
If this rogue stream is eliminated from the analysis, the the resulting plots do not show any problems (try it), but now the results are statistically signicant (p=.035):
24 Because the natural log transform was used and the data on the log scale was used, smallish slope coefcients have an approximate interpretation. In this example, a slope of .025 on the (natural) log scale implies that the estimated sh density INCREASES by 2.5% every time the slope increases by one percentage point.
1527
The estimated change in log-density per percentage point change in the slope is found to be:
i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases sh density by 5%.25 The remaining residual plot and leverage plots show no problems.
22.6
Example: Predicting PM10 levels
Small particulates are known to have adverse health effects. Here is some background information from Wikipedia:26 The effects of inhaling particulate matter has been widely studied in humans and animals and include asthma, lung cancer, cardiovascular issues, and premature death. The size of the particle determines where in the body the particle will come to rest if inhaled. Larger particles are generally ltered by small hairs in the nose and throat and do not cause problems, but particulate matter smaller than about 10 micrometers, referred to as PM10, can settle in the bronchial tubes and lungs and cause health problems. Particles smaller than 2.5 micrometers, PM2.5, can penetrate directly into the lung, whereas particles smaller than 1 micrometer PM1 can penetrate into the alveolar region of the lung and tend to be the most hazardous when inhaled. The large number of deaths and other health problems associated with particulate pollution was rst demonstrated in the early 1970s (Lave et. al, 1973) and has been reproduced many times
25 This easy interpretation occurs because the natural log transform was used. If the common (base 10) log transform was used, there is no longer such a simple interpretation. 26 Downloaded from http://en.wikipedia.org/wiki/Particulate on 2006-05-22
1528
CHAPTER 22. MULTIPLE LINEAR REGRESSION since. PM pollution is estimated to cause 20,000-50,000 deaths per year in the United States (Mokdad et. al, 2004) and 200,000 deaths per year in Europe. For this reason, the US Environmental Protection Agency (EPA) sets standards for PM10 and PM2.5 concentrations in urban air. EPA regulates primary particulate emissions and precursors to secondary emissions (NOx, sulfur, and ammonia). Many urban areas in the US and Europe still frequently violate the particulate standards, though urban air has gotten cleaner, on average, with respect to particulates over the last quarter of the 20th century. The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to trafc volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable consist of hourly values of the logarithm of the concentration (why?) of PM10 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables are the logarithm of the number of cars per hour, temperature 2 meters above ground (degree C), wind speed (meters/second), the temperature difference between 25 and 2 meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001. The data were extracted from http://lib.stat.cmu.edu/datasets/ and are available in the le pm10.jmp in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. Wind direction is an interesting variable as it ranges from 0 to 360 around a circle and cannot be used directly in a regression setting after all a direction of 1 degree and 359 degrees is so similar, but have vastly different measured values. Examine the histogram of the wind directions (obtained from the Analyze->Distribution platform):
1529
This seems to indicate that there are two major wind directions. The E winds correspond to wind directions between from about 320 360 degrees and from 0 150 degrees, while the W winds correspond to directions between 150 320 degrees. Convert these measurements into a nominal scaled variable using JMPs formula editor:
1530
This classies the wind direction into the two categories. A character coding is used to prevent computer packages from interpreting a numeric code as an interval or ratio scaled variable. An indicator variable could be created for this variable as seen in earlier chapters. An initial scatter plot matrix of the data is obtained by using the Analyze->MultiVariateMethods->Multivariate platform:
1531
1532
There is no obvious relationship among the variables. The plot of the day variable show a large gap. Inspection of the data shows that recording was stopped from about 100 days in the middle of the data set the reasons for this are unknown. The number of cars/hour varies over the hour of the day in a predictable fashion. The wind direction variable shows that most of the data points have wind blowing in two major directions corresponding to E and W as broken into categories earlier. A plot of the log(PM10) concentration by the condensed wind direction:
1533
shows no obvious relationship between the PM10 and the wind direction. The Analyze->Fit Model platform was used to t a model to the continuous and indicator variable.
1534
The leverage plots (not shown) dont reveal any problems in the t. The actual vs. predicted plot:
1535
appears to show some evidence that the tted line tends to under predict at high log(PM10) concentrations and over predict at lower log(PM10) concentrations, but the visual impression may be an artifact of the density of points. The residual plot:
1536
dont show any problems with the t. In any case, the R2 is not large indicating plenty of residual variation not explained by the regressor variables. The estimates table:
doesnt show any problems with variance ination, but perhaps some variables can be deleted. Use the Custom Test option:
1537
to see if the day, wind-direction, and hour can be removed. [I suspect that any hour effect has been taken up by the log(cars) effect and so is redundant (why?). Similarly, any trend over time (the day effect) may also be included in the log(cars) effect (why?)]:
[Why are three columns needed to test the three variables?] The results of the chunk test are:
1538
showing that these variables can be safely deleted. The Analyze->Fit Model platform is again used, but now dropping these apparently redundant variables. The revised estimates from this reduced model again show no problems in the leverage plots, no problems in the residual plots, and no problems in the VIF. The estimates are:
1539
This time, it appears that both temperature variables are also redundant. This is somewhat surprising, but on sober second thought, perhaps not so. The temperature wouldnt affect the creation of particles after all if the cars are the driving force behind the levels, the cars will produce the same particular levels regardless of temperature. Perhaps temperature only affects how the PM10 levels affect human health, i.e. on hot days, perhaps people feel affect more by pollution. A chunk test using the Custom Test procedure shows that the temperature variable can also be dropped (not shown). The nal model includes only two variables, the log(cars/hour) and the wind speed. The nal estimates are:
As the number of cars/hour increases, the pollution level increases. As both the pollution level and the number of cars have been measured on the log scale, the coefcient must be interpreted carefully. A doubling of the number of cars corresponds to an increase of .7 on the natural logarithm scale (log(2)=.7). Hence, the log(PM10) increases by .7(.32)=.22 which corresponds to exp(.22) = 1.25 times increase on the anti-log scale. In other words, a doubling of cars/hour corresponds to a 25% increase in the PM10 levels. As wind speed increases, the concentration of PM10 decreases. A similar exercise shows that an increase in wind speed by 1 m/second causes the PM10 concentration decrease by about 10%. The leverage plots and residual plots show no problems in the data. How well does the model perform in practice? On way to assess this is to save the Std Err of predictions of the mean and of individual predictions to the data table:
1540
(similar actions are done to save the std error for individual predictions and the actual predicted values). Then compute the ratio of each of the standard errors to the predicted values:
1541
(again, only one formula is shown) and use the Analyze->Distribution platform to see the histograms of the relative prediction errors:
1542
Predictions of the MEAN response are fairly good the relative standard errors are under 5% so the 95% condence intervals for the predicted response will be fairly tight. However, as expected, the prediction intervals for individual response are fairly poor the relative prediction standard errors are around 25% which means that the 95% prediction intervals will be 50%! It is unclear how useful this is for advising individuals to take preventive actions under certain conditions of trafc volume and wind speed.
22.7
22.7.1
Variable selection methods

Introduction
Up to now, it has been assumed that the variables to be used in the regression equation are basically known and all that matters is perhaps deleting some variables as being unimportant, or deciding upon the degree of the polynomial needed for a variable. In some cases, researchers are faces with several tens (sometimes hundreds or thousands) of predictors and help is needed in even selecting a reasonable subset of variables to describe the relationship. The tech1543
CHAPTER 22. MULTIPLE LINEAR REGRESSION niques in this section are called variable selection methods. CAUTION: Variable selection methods, despite their apparent objectivity, are no substitute for intelligent thought. As you will see in the remainder of this section, there are numerous caveats that must be kept in mind when using these methods. There are two philosophies underlying variable selection methods. The rst philosophy is that the there is a unique correct model that explains the data. This MAY be true in physical systems where the goal of the project is to understand mechanisms of actions. The role of variable selection is to try and come and up with the variables that describe the mechanism of action. The second philosophy (and one that I personally nd more appealing) is that reality is hopelessly complex and that all our models are wrong. We hope via regression methods to come up with a prediction function that works satisfactorily. There is NO unique set of predictors which is correct there may be several sets of predictors that all give reasonable answers and the choice among these sets is not obvious. In both cases, model selection follows ve general steps: 1. Specify the maximum model (i.e. the largest set of predictors). 2. Specify a criterion for selecting a model. 3. Specify a strategy for selecting variables. 4. Specify a mechanism for tting the models usually least squares. 5. Assess the goodness-of-t of the the models and the predictions.
22.7.2
Maximum model
The maximum model is the set of predictors that contains all potential predictors of interest. Often re2 searchers will add polynomial (e.g. X1 ), crossproduct terms (e.g. X1 X2 ), or transformations of variables (e.g. ln(X1 )). If the rst philosophy is correct, this maximal model must contain the correct model as a subset of the potential predictor variables. As the maximum model, this model has the highest predictive power, but some predictors may be redundant. Under the second philosophy, we know that this (and all models) are wrong, but we hope that this maximal model is a reasonable predictor function. Again, some predictors may be redundant. Some caution must be used in specifying a maximum model. First, try to avoid including many variables that are collinear. For example, height and weight are highly collinear and are both variables really needed? If including polynomial or cross product terms, center (i.e. subtract the mean) before squaring the variables or taking cross-products. Use scientic knowledge to select the potential predictors and the shape of the prediction function. Classication variables (i.e. nominal or interval scaled variables) will generate a separate indicator variable for each level of the variable. Some computer programs (e.g. JMP) may generate contrasts among these indicator variables as well.
1544
CHAPTER 22. MULTIPLE LINEAR REGRESSION Second, there are various rule of thumb for the maximum number of predictors that should be entertained for a dataset. Generally, you want about 10 observations for each potential predictor variable. Hence, if your maximum model has 30 potential predictor variables, this rule of thumb would require you have at least 300 observations! Remember that a nominal scaled variable with k values will required k 1 indicator variables! Third, examine the contrast within variables. If a variable is essentially constant (e.g. every subject had essentially the same weight), then this a useless predictor variable as no effect of weight will be apparent. If an indicator variable only points to a single case (e.g. only a single female in the dataset) then the results may be highly specic to the dataset analyzed. Low contrast variables should not be included in the maximum model.
22.7.3
Selecting a model criterion
The model criterion is an index that is computed for each candidate model and use to compare the various models. Given a particular criterion, one can order the models from best to worst. The criterion used should be related to the goal of the analysis. If the goal is prediction, the selection criterion should be related to errors in predictions. If the goal is variable subset selection, then the criterion should be related to the quality of the subset. There is NO single best criterion. A literature search will reveal at least 10 criteria that have been proposed. In this chapter, ve of the criteria will be discussed this is not to say that these ve are the optimal criteria, but rather the most frequently chosen. These criteria are R2 , Fp , M SEp , Cp , and AIC.
R2 The R2 criterion is the simplest criterion in use. The value of R2 measures, in some sense, the proportion of total variation in the data that is explained by the predictors. Consequently, higher values of R2 are better. However, this criterion has a number of defects. First R2 will never decrease as you add variables (regardless of usefulness) to models. But in many cases, a plot of R2 by the number of variables shows a rapid increase a variables are added, then a leveling off where new variables essentially add very little new information. Model near the bend of the curve seem to offer an reasonable description of the data. Some packages attempt to adjust the value of R2 for the number of variables (called the adjusted R2 ), and so the value of the adjusted R2 again near the bend of the curve would be the target.
Fp The Fp criterion is essentially a number of hypothesis tests to see which set of p variables is not statistically different from the full model. If the test statistic for a set of p predictors is not statistically signicant, then the other variables can be dropped.
1545
CHAPTER 22. MULTIPLE LINEAR REGRESSION The danger with this criterion is that every test has a probability of a Type I (false positive) error. So if you do 50 tests, each at = .05, there is a very good chance that at least one of the tests will show a statistically signicant results when in fact it is not. If you decide to use this criterion, you likely want to do the tests at a more stringent criterion, i.e. use = .01 or = .001.
M SEp This criterion uses the estimated residual variance about the regression line. This residual variance is a combination of unexplainable variation and excess variation caused by unknown predictors. In many cases, there is a subset that has the minimal residual variation.
Cp and AIC These are two related (and in linear regression can be shown to be equivalent) criterion. Mallows CP , is computed as: Cp = SSE(p) [n 2(p + 1)] M SE(k)
where M SE(p) is the mean square error from the subset with p predictors EXCLUDING the intercept 27 ; M SE(k) is the M SE from the maximum model; and n is the number of observations. If the maximum model does contain the truth, then Mallow showed that Cp should be close to p + 128 for a subset model that is closest to the truth. Akaike Information Criterion (AIC) is a 1-1 transformation of the Cp and can be thought of as AIC = t + penalty for predictors . In the case of multiple regression, AIC has a simple form: AIC = nlog( SSE ) + 2p n
where again p is the number of predictors INCLUDING the intercept. The model with the smallest AIC is usually preferred as this model has the best t after accounting for a penalty for adding too many predictors. However, AIC goes further. Under the philosophy that all models are wrong, but some are useful, it is possible to obtain model weights for several potential models, and to average the results of several competing models. This avoids the entire discussion of which is the best wrong model, but rather works on the philosophy that if several models that all seem to t the data similarly give wildly different answers, then
27 Some textbooks dene p to INCLUDE the intercept and so the last term may look like n 2p rather than n 2(p + 1). Both are equivalent. 28 Again, if p is dened to include the intercept, then C should be close to p rather than p + 1 p
1546
CHAPTER 22. MULTIPLE LINEAR REGRESSION this uncertainty in the response must be incorporated. Burhnam and Anderson (2002) has a very nice book on the use of AIC and its philosophy. Unfortunately, the use of model weights is beyond the scope of this course.
22.7.4
Which subsets should be examined
When we start with k potential predictors, there are many, many potential models that involve subsets of the k predictors. How are these subsets chosen?
All possible subsets If there are k predictors variables in the maximum models, there are around 2k possible subsets. This number can be enormous for example with 10 potential predictors, there are around 210 = 1024 subsets; with 20 predictors, there are around 220 = 1, 048, 576 possible models etc. With modern computers and good algorithms, it is actually possible to search all subsets for up to about 15 predictors (and this number gets higher each year).29 Dont use Excel! The all possible subsets strategy is preferred for reasonably sized problems. Because it looks at all possible models, it is unlikely that you would miss the correct among the subsets. However, there may be several different models that all are essentially the same and being forced to select one of these models is a bit arbitrary hence one of the driving forces behind the AIC.
Backward elimination If you have many predictors, then all possible subsets may not be feasible. The backward elimination procedure starts with the maximum model and successively deletes variables until no further variables can be deleted. The algorithm proceeds as follows: 1. Fit the maximum model 2. Decide which variable to delete. Look at each of the individual p-values for variables still in the model. If all of the p-values are less than some (say .05 but this varies among packages), then stop. Else, nd the variable with the largest (why?) p-value and drop this variable. 3. Ret the model. Ret the model after dropping this variable, and repeat step 2 until no further variables can be deleted.
29 It turns out that by cleverly computing various statistics, you can actually predict the results from many subsets without actually having to t all the subsets.
1547
CHAPTER 22. MULTIPLE LINEAR REGRESSION One must be careful to ensure that models are hierarchical, i.e. if a X 2 term remains in the model, then the corresponding X terms must also remain. Many computer packages will violate this restriction if left to their own devices.
Forward addition This is the reverse of the backward elimination procedure. Start with a null model, and keep adding variables until no more can be added. The variable at each step with the smallest increment p-value is the variable that is added. Again, you must ensure that if X 2 terms are entered, that the corresponding X term is also entered.
Stepwise selection It may turn out that adding a variable during a forward process makes an existing variable redundant. The forward addition process has no mechanism for deleting variables once theyve entered the model. In a stepwise selection procedure, after a variable is entered, a backward elimination procedure is attempted to see if any variable can be removed.
Closing words In all of these automated selection procedures, there is no guarantee that the chosen model will be optimal in any sense. As well, because of the many, many statistical tests performed, none of the p-values at nal step should be interpreted literally. It is also well known, that if data that is generated completely at random is used with stepwise methods, it will select a model for prediction that is often just noise. Consequently, the results that you obtain may be highly specic to the dataset collected and may not be reproducible with other datasets. Refer to Section 22.7.5 for ideas on evaluating the reliability of the analysis.
22.7.5
Goodness-of-t
Even with automated variable selection methods, there is no guarantee that the tted models actually t the data well. Consequently, the usual residual diagnostics must be performed as outlined in earlier sections. At the same time, the analyst should avoid becoming xated with the results from a single dataset. There is no guarantee that the results from this particular dataset translate into other datasets. There are several ways to try and assess how well the chosen relationship will work in the future:
1548
CHAPTER 22. MULTIPLE LINEAR REGRESSION Try on a new dataset. In some cases, the study can be be repeated and a comparison of the model selected from the existing and new study is instructive. Split-sample. If there are many observations, the sample can be split into two. Model selection is done on each half independently, and the two analyses compared. If a variable is selected in one half, but not the other, this is an indication of instability in the analysis. How well does the model do in predictions? Recall that R2 measures the percentage of variation explained by the model. Use the rst half of the data, t a model, and nd the R2 for the rst half. Use the model from the rst sample to predict the data points for the second sample and compute the correlation2 between the observed and predicted values. This second R2 will typically be smaller than the R2 based on the rst sample. If the shrinkage in R2 is large, this is is bad news it implies that the results from the rst sample did not do well in predicting the values in the second sample. Cross validation. In some cases, you do not have sufcient data to split into two halves. In these cases, single case cross validation is often attempted. In this method, you t a model excluding each case in turn, and then use the tted model to t the hold out case. A comparison of the tted vs. actual values is a measure of predictive ability.
22.7.6
Example: Calories of candy bars
The JMP installation includes a dataset on the composition of popular candy bars. This is available under the Help Sample Data Library Food and Nutrition section or in the candybar.jmp le in the Sample Program Library in the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms directory. For each of about 50 brands of candy bars, the total calories and the composition (grams of fat, grams of ber, etc.) were measured. Can the total calories be predicted from the various constituents? A preliminary scatter plot of the data:
1549
1550
shows a strong relationship between calories and total grams of fat and/or grams of saturated fat, but a weaker relationship between calories and grams of protein and grams of carbohydrates. There are no obvious outliers except for a few candy bars which appear to have unusual levels of vitamins (?). The Analyze->Fit Model platform is used to request a stepwise regression analysis to try and predict the number of calories in the candy bars:
1551
In this case, the philosophy that the correct model must be a subset of these variables is likely correct. The mechanism by which calories appear in food is well understood - likely a combination of fat, protein, and carbohydrates. It is unlikely that ber, or vitamins contribute anything substantial to the total calories. The stepwise dialogue box has a number (!) of options and statistics available:
1552
Detailed explanation of these features is available in the JMP help, but a summary is below: The direction of the stepwise procedure can be changed from forward, to backwards, or to mixed. If you wish to do backwards elimination, you will have to Enter All variables rst before selecting this option. All possible regressions is available from the red-triangle pop-down menu. The probability to enter and to leave are set fairly liberally. A probability to enter of 0.25 indicates that variables that have any chance of being useful are added; the probability to leave indicates that as long as some predictive marginal ability is available, the variable should be retained. If the Go button is pressed, the procedure is completely automatic. If the Step button is pressed, the procedure goes step-by-step through the algorithms. The Make Model button is used at the end to t
1553
CHAPTER 22. MULTIPLE LINEAR REGRESSION the nal selected model and obtain the usual diagnostic features. The package reports the MSE, R2 , the adjusted R2 , Cp , and AIC for each model. These can be used to assess the progress of the procedure. The actual model under consideration are those variables with check marks inside the Entered boxes. If you wish to force a variable to be always present, this is possible by entering a variable and locking it in. Change the direction to Mixed and then repeatedly press the Step button. For the rst step, the program computes the p-values for each new variable to enter the model. The variable with the smallest p-value that is below the Prob to Enter will be selected to enter, which is the Total Fat Variable
1554
The model now consists of the intercept and the total fat variable for a total of p = 2 predictors. The Cp is extremely large; the R2 has increased from the previous model; the M SE has decreased. None of the variables has p-values greater than the Prob to Leave so nothing happens for the leaving step and the step button must be pressed again. Based on the previous output, the carbohydrate variable will be entered (why?):
1555
and then the protein variable (why?):
1556
At this point we are now getting models with enormous R2 values (close to 100%) which is practically unheard of in ecological contexts. Note that Cp is becoming close to p. At this point which variable would be entered next? Surprisingly, sodium is entered next, followed by saturated fat and nally the procedure halts:
1557
Both backward elimination and forward selection also pick this nal model (try it). The Make Model button will take these selected variables and create the Analyze->Fit Model dialogue box to t this nal model:
1558
None of the leverage plots show anything a miss; the residual plots look good. The nal estimates are:
The VIF for total fat is a bit worrisome - notice that both total fat and saturated fat variables are in the model. Presumably, saturated fat is included in the total fat and is redundant. Try retting this model dropping the saturated fat variable and reexamine the estimates:
1559
Again all the leverage plots look ne; and the VIF are all small. In our nal model, each additional gram of total fat increases calories by 8.9 g30 ; each additional gram of protein increases calories by 4.7 g;31 each additional gram of carbohydrates increases calories by 4.1 grams;32 ; and each mg of sodium decreases calories by a miniscule amount. The biological relevance of the sodium contribution is unknown. Perhaps this is an artifact of this particular data set? This particular example was easy as the true model is known and the response is almost exactly predicted by the predictors. As noted earlier, most ecological contexts are not so nearly perfect.
22.7.7
Example: Fitness dataset
- this will be demonstrated in class
22.7.8
Example: Predicting zoo plankton biomass
What drives the biomass of zoo plankton on reefs? The zoo plankton was broken into two size classes (190600 m and >600 m) and environmental variables were sampled at 51 irregularly spaced sites (sampling interval: 156-37 m), arranged along a straight-line cross-shelf transect 8.4 km in length. The raw data are available at http://www.esapubs.org/archive/ecol/E085/050/suppl-1. htm#anchorFilelist in the Guadeloupe.txt le and is in the guadeloupe.jmp le in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The response variable is the log-transformed zoo plankton biomasses of two size classes (original units: mg/m3 ash-free dry mass)33 . The predictor variables include coordinate (km) of the sampling site along the transect.
30 The 31
accepted value for fat is 9 calories/gram. The accepted value for protein is 4 calories/gram. 32 The accepted value for carbohydrates is 4 calories/gram 33 Why was a log-transform used?
1560
CHAPTER 22. MULTIPLE LINEAR REGRESSION environmental variables such as dissolved oxygen (mg/L), salinity (psu), wind speed (m/s), phytoplankton biomass (log-transformed, original units: g/L), turbidity (NTU), swell height (m) habitat variables codes as 14 indicator variables indicating various habitat classes. We will try and develop a prediction equation for the larger zoo plankton category. It is always good practice to do some preliminary plots of the data to search for outliers and general trends in the data before beginning a more sophisticated analysis. Start with a scatter plot-matrix of the continuous variables obtained from the Analyze->MultiVariateMethods>Multivariate platform:
1561
1562
There appears to be a strong bivariate relationship of biomass with distance along the transect line and phytoplankton biomass. At the same time, several of the predictors appear to be highly related. For example, the distance along the transect line and phytoplankton biomass are very strongly related as is wind speed and swell height. A quadratic relationship between some of the predictor variables is also apparent (e.g. wind speed vs. distance). A few unusual points appear, e.g. look at the plot of salinity vs. the log(zooplankton) where two points seem at odds with the rest of the data. By clicking on these points, we see that these correspond to site 5 (whose marker I subsequently changed to an X to see where it t in the rest of the plot), and site 1 (whose marker I subsequent changed to a triangle for the remainder of the analysis).
1563
CHAPTER 22. MULTIPLE LINEAR REGRESSION A common problem with indicator variables is insufcient contrast, i.e. there are only a few sampling sites with a particular habitat variable. You can see how many of each habitat variable are present by simply counting the number of 1s in each indicator variable column or nding the sum of each column.
1564
1565
1566
These indicate that there is only 1 site with under 25% coverage of sea-grass on muddy sand, and most of the indicator variables occur on less than 10% of the sites. I will be hesitant to read too much into any regression equation that includes most of these indicator variables as I suspect they will specic to this particular dataset and not generalizable to other datasets. So, based on this preliminary analysis, I would expect that the distance and/or phytoplankton and/or turbidity would be the primary predictor for zooplankton biomass in this category. With only 51 data points, I would be reluctant to include more than about 5 predictor variables using the rule of thumb of 10 observations per predictor. The Analyze->Fit Model platform is used to request a stepwise regression analysis:
1567
A stepwise analysis is requested.
1568
The step history :
1569
shows that R2 increases fairly rapidly until it hits around 80% and then tends to level off; the Cp approaches p34 also around step 9 or 10. The summary of the steps shows that the transect location is the rst variable in, followed, surprisingly, by several indicator variables, followed by phytoplankton biomass. It is somewhat surprising that both the transect location and the phytoplankton biomass are both entered into the model as they are highly related. Rerun the stepwise procedure, a step at a time for the rst 9 steps and then press the MakeModel button:
34 Note
that JMP uses the convention that the count p INCLUDES the intercept.
1570
to actually t this model. The plot of actual vs. predicted :
1571
shows a reasonable t. Some of the leverage plots for the indicator variables show that the t is determined by a single or pair of sites:
The VIF for the transect location and phytoplankton biomass variables: 1572
are large a consequence of the strong relationship between these two variables. I would subsequently remove one of the the transect location or the phytoplankton biomass variables and would likely remove any indicator variable that is entered that depends on a single site as this is surely an artifact of this particular dataset. All possible subsets is barely feasible with this size of regression problem. It took less than three minutes to t on my Macintosh G4 at home, but the output le was enormous! I suspect that unless some way is found to condense the output to something more user friendly, that this would not be a feasible way to proceed. This is the end of the chapter
1573

Ewan

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ewan

Încărcat de

Drepturi de autor:

Formate disponibile

Chapter 22

Multiple linear regression

CHAPTER 22. MULTIPLE LINEAR REGRESSION

Data format and missing values

c 2012 Carl James Schwarz

The formal model

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

Example: blood pressure

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

This model is t using the Analyze->Fit Model platform:

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

CHAPTER 22. MULTIPLE LINEAR REGRESSION of these notes.

Regression problems and diagnostics

c 2012 Carl James Schwarz

Actual vs. Predicted Plot

Detecting inuential observations

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

c 2012 Carl James Schwarz

and the leverage plot for Runtime is:

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

Polynomial, product, and interaction terms

Example: Tomato growth as a function of water

c 2012 Carl James Schwarz

As usual, begin with a plot of the data:

c 2012 Carl James Schwarz

CHAPTER 22. MULTIPLE LINEAR REGRESSION

c 2012 Carl James Schwarz