Documente Academic
Documente Profesional
Documente Cultură
7.1 Background
Let Y be the dependent variable, and X1, X2, ..., Xi, ..., Xn be the predictor
variables. The linear regression model is defined in Equation (7.1):
The bs are the raw regression coefficients, which are estimated by the
methods of ordinary least-squares and maximum likelihood for a continuous
and binary-dependent variable, respectively. Once the coefficients are esti-
mated, an individual’s predicted Y value is calculated by ‘plugging-in’ the
values of the predictor variables for that individual in the equation.
The interpretation of the raw regression coefficient points to the following
question: how does Xi affect the prediction of Y? The answer is that the
predicted Y experiences an average change of bi with a unit increase in Xi
For the logistic regression model, the problem of calculating the standard
deviation of the dependent variable, which is the logit Y, not Y, is compli-
cated. The problem has received attention in the literature with solutions
that provide inconsistent results. [1] The simplest is the one used in the SAS
system, although it is not without its problems. The StdDev for the logit Y
is taken as 1.8138 (the value of the standard deviation of the standard logistic
distribution). Thus, the SRC in logistic regression model is defined in Equa-
tion (7.3).
The SRC can also be obtained directly by performing the regression anal-
ysis on standardized data. You will recall that standardizing the dependent
variable Y and the predictor variables Xis creates new variables zY and zXi,
such that their means and standard deviations are equal to zero and one,
respectively. The coefficients obtained by regressing zY on the zXis are, by
definition, the standardized regression coefficients.
The question “which variables in the model are its most important pre-
dictors, in ranked order?” needs to be annotated before being answered. The
importance or ranking is traditionally taken in terms of the statistical char-
acteristic of reduction in prediction error. The usual answer is provided by
the decision rule: the variable with the largest SRC is the most important
variable; the variable with the next largest SRC is the next important variable,
and so on. This decision rule is correct with the unnoted caveat that the predictor
variables are uncorrelated. There is a rank-order correspondence between the
SRC (the sign of coefficient is ignored) and the reduction in prediction error
in a regression model with only uncorrelated predictor variables. There is
one other unnoted caveat for the proper use of the decision rule: the SRC
TABLE 7.1
Small Data
ID # Response Profit Age Gender Income
1 1 185 65 0 165
2 1 174 56 0 167
3 1 154 57 0 115
4 0 155 48 0 115
5 0 150 49 0 110
6 0 119 40 0 99
7 0 117 41 1 96
8 0 112 32 1 105
9 0 107 33 1 100
10 0 110 37 1 95I
TABLE 7.3
PROFIT Regression Ouput Based on Standardized Small Data
Parameter Standard Standardized
Variable DF Estimate Error t Value Pr > |t| Estimate
INTERCEPT 1 –4.76E-16 0.0704 0.0000 1.0000 0.0000
zINCOME 1 0.3516 0.1275 2.7600 0.0329 0.3516
zAGE 1 0.5181 0.1687 3.0700 0.0219 0.5181
zGENDER 1 –0.1998 0.1209 –1.6500 0.1496 –0.1998
TABLE 7.5
RESPONSE Regression Output Based on Standardized Small Data
Wald
Parameter Standard Chi- Pr > Standardized
Variable DF Estimate Error Square ChiSq Estimate
INTERCEPT 1 –6.4539 31.8018 0.0412 0.8392
zINCOME 1 1.8339 46.7291 0.0015 0.9687 1.0111
zAGE 1 19.1800 65.5911 0.0855 0.7700 10.5745
zGENDER 1 7.3997 42.5842 0.0302 0.8620 4.0797
TABLE 7.6
Necessary Data
zAGE Score- OTHERVARS
Total Predicted Point Score-Point
ID # zAGE zINCOME zGENDER Score Contribution Contribution zAGE_OTHVARS
1 1.7354 1.7915 –0.7746 24.3854 33.2858 –8.9004 3.7398
2 0.9220 1.8657 –0.7746 8.9187 17.6831 –8.7643 2.0176
3 1.0123 –0.0631 –0.7746 7.1154 19.4167 –12.3013 1.5784
4 0.1989 –0.0631 –0.7746 –8.4874 3.8140 –12.3013 0.3100
5 0.2892 –0.2485 –0.7746 –7.0938 5.5476 –12.6414 0.4388
6 –0.5243 –0.6565 –0.7746 –23.4447 –10.0551 –13.3897 0.7510
7 –0.4339 –0.7678 1.1619 –7.5857 –8.3214 0.7357 11.3105
8 –1.2474 –0.4340 1.1619 –22.5762 –23.9241 1.3479 17.7492
9 –1.1570 –0.6194 1.1619 –21.1827 –22.1905 1.0078 22.0187
10 –0.7954 –0.8049 1.1619 –14.5883 –15.2560 0.6677 22.8483
2. Calculate the zAGE SCORE-POINT contribution for each individual
in the data. For individual ID #1, the zAGE score-point contribution
is 33.2858 ( = 1.7354*19.1800).
3. Calculate the OTHER VARIABLES SCORE-POINT contribution for
each individual in the data. For individual ID #1, the OTHER VARI-
ABLES score-point contribution is –8.9004 ( = 24.3854 – 33.2858).
4. Calculate the zAGE_OTHVARS for each individual in the data.
zAGE_OTHVARS is defined as the absolute ratio of zAGE
SCORE_POINT contribution to the OTHER VARIABLES SCORE-
POINT contribution. For ID #1, zAGE_OTHVARS is 3.7398 ( = abso-
lute value of 33.2858/–8.9004).
5. Calculate PCC(zAGE), the average (median) of the
zAGE_OTHVARS values: 2.8787. The zAGE_OTHVARS distribution
is typically skewed, which suggests that the median is more appro-
priate than the mean for the average.
I summarize the results the PCC calculations after applying the five-step
process for zINCOME and zGENDER (not shown).
Note that I do not use the standardized variable names (e.g., zAGE instead
of AGE) in the summary of findings. This is to reflect that the issue at hand
in determining the importance of predictor variables is the content of vari-
able, which is clearly conveyed by the original name, not the technical name,
which reflects a mathematical necessity of the calculation process.
TABLE 7.8
Decile-PCC: Actual Values
Decile PCC(zAGE) PCC(zGENDER) PCC(zINCOME)
Top 3.7398 0.1903 0.1557
2 2.0176 0.3912 0.6224
3 1.5784 0.4462 0.0160
4 0.4388 4.2082 0.0687
5 11.3105 0.5313 0.2279
6 0.3100 2.0801 0.0138
7 22.8483 0.3708 0.1126
8 22.0187 0.2887 0.0567
9 17.7492 0.2758 0.0365
Bottom 0.7510 0.3236 0.0541
Overall 2.8787 0.3810 0.0627
7.6 Summary
I briefly reviewed the traditional approach of using the magnitude of the
raw or standardized regression coefficients in determining which variables
in a model are its most important predictors. I explained why neither coef-
ficient yields a perfect “importance” ranking of predictor variables. The raw
regression coefficient does not account for inequality in the units of the
variables. Furthermore, the standardized regression coefficient is theoreti-
cally problematic when the variables are correlated, as is virtually the situ-
ation in database applications. With a small data set, I illustrated for ordinary
and logistic regression models the often-misused raw regression coefficient,
and the dubious use of the standardized regression coefficient in determining
the rank importance of the predictor variables.
I pointed out that the ranking based on standardized regression coefficient
provides useful information without raising sophistic findings. An unknown
working assumption is that the reliability of the ranking based on the stan-
dardized regression coefficient increases as the average correlation among
the predictor variables decreases. Thus, for well-built models, which neces-
sarily have a minimal correlation among the predictor variables, the resultant
ranking is accepted practice.
Reference
1. Menard, S. Applied Logistic Regression Analysis, Sage Publications. Series: Quan-
titative Applications in the Social Sciences, Thousand Oaks, CA, 1995.