Selection Bias (Heckman-SPSS)

Jeroen Smits (http://home.planet.nl/~smits.
jeroen) Heckman with SPSS 1
Estimating the Heckman two-step procedure to control

for selection bias with SPSS
Jeroen Smits
(http://home.planet.nl/~smits.jeroen)
September 2003
This paper shortly discusses two main forms of the selection

bias problem and a method which in a number of cases can be
used to control for this kind of bias: the Heckman two-step
procedure. After this, detailed instructions are given about
how the Heckman procedure can be applied using the statistical
package SPSS.
1. Introduction
Many statistical software packages like SAS, STATA, or LIMDEP offer
the possibility to use the Heckman two-step procedure to control for
selection bias (although the possibilities in these packages
sometimes are rather limited). However in SPSS, the statistical
package which is widely used by social researchers, no procedure for
applying this method is available. That does not mean that it is
completely impossible to apply this method with SPSS. With some
additional computations, the SPSS procedures PROBIT or LOGISTIC
REGRESSION can be used to construct a Heckman selection bias control
factor. This control factor, then, can be added to an OLS regression
analysis in which selection bias is a problem, to produce unbiased
parameter estimates. To get also correct standard errors for these
parameters, another step can be taken in which a WLS regression
analysis is performed using weights constructed on the basis of the
outcomes of the earlier steps. This paper gives detailed
instructions on how this can be done.
1.1 Selection bias
There are basically two versions of the selection bias problem. In

the standard case of selection bias, information on the dependent
variable for part of the respondents is missing. For example, if we
want to estimate the effect of education of women on their income,
we meet the problem that many women are not engaged in paid work and
hence have no income. If a substantial part of these nonemployed
women has no job because their returns to education were relatively
low, running a regression with income as dependent variable and
Jeroen Smits (http://home.planet.nl/~smits.jeroen) Heckman with SPSS 2
education as one of the predictors may lead to a biased estimates of

the effect of education on income.
In the other version of the selection bias problem, information on
the dependent variable is available for all respondents, but the
distribution of respondents over categories of the independent
variable we are interested in has taken place in a selective way.
For example, we may want to study the effect of migration on income,
using a random sample of the population for which we know the income
and whether or not they migrated to another place in the past. If we
simply run a regression with income as dependent variable and a
dummy indicating whether or not the respondent migrated in the past
as one of the independent variables, we may get a biased estimate of
the migration effect because the distribution of respondents over
the categories of migrants and nonmigrants was not random. People
who choose to migrate may differ in many (measured and unmeasured)
characteristics from people who don't. If these characteristics are
related to income, the coefficient of the migration dummy may catch
up these effects and be biased because of this. Controlling for
these differences would solve the problem. However, this is
generally not possible, because in any data set the number of
control factors is limited, whereas the number of possible
differences among individuals is infinite. One can never be sure
that all relevant differences are taken into account. This second
form of selection bias is sometimes called heterogeneity bias.
Common to both forms of selection bias is that there is a
selection process by which individuals are divided over two (or
more) groups (employed/nonemployed; migrants/nonmigrants) and that
nonrandomness in this process disturbs the estimation of other
relationships which are of substantial interest. With other words,
there are two processes (which can be described with two equations,
called "selection equation" and "substantial equation") and these
processes are related to each other. This relationship will be
reflected in a non-zero correlation between the error terms of the
equations. If such a correlation is present, we cannot estimate the
substantial equation without taking the selection process into
account.
Most statistical packages which offer the possibility to estimate
Heckman models restrict themselves to the standard version of the
problem. However, the Heckman two-step procedure can also be used to
address the other form of selection bias. In this paper I give the
SPSS instructions for both.
As an example of the use of the method, I will show how it can be
applied to correct two simple income equations, one in which the
income of working women is explained on the basis of their age and
educational level and one in which the income of respondents is
explained on the basis of their age, educational level and a dummy
indicator of whether or not they migrated in the past.
Readers who want to know more about selection bias or the Heckman
procedure might read Breen (1996), Winship and Mare (1992) or one of
the classical papers of Heckman (1976, 1979). Part of the
information on how to estimate the Heckman procedure with SPSS was
derived from Ploeg (1993).
1.2 The Heckman procedure
In the first step of the Heckman procedure, the selection process

which is responsible for selection bias problems is studied with the
so-called selection model. The bias is caused by the existence of
differences between employed and unemployed women (or between
migrated and nonmigrated persons) which are related to their income.
So it is necessary to compare these groups (employed and nonemployed
women; migrants and nonmigrants) to find out what the differences
are. For this purpose, generally a probit model is estimated
(because the error term of this model is normally distributed, one
of the assumptions underlying the Heckman model). However, with some
"tricks" also other techniques like logit analysis can be used.
In our examples, the dependent variable in the probit analysis is
a dummy variable indicating whether or not the female is employed or
the respondent has migrated. Independent variables in the model are
the (relevant) characteristics of the respondents available in the
data set; in the examples education, age, and the number of
children. In the probit analysis, we estimate the effects of these
variables on the employment/migration decision. However, these
effects themselves are not really of interest, because these
variables are available in the data set and hence we can control for
them in the income analysis. What we really want to know is the
effect of the unmeasured characteristics of the respondents on the
employment/migration decision. Of course, information on the effect
of these unmeasured characteristics is not available in the
coefficients of the explanatory variables. However, in the residuals
of the probit analysis it is. After all, the variation which remains
in the dependent variable after removing the effect of the known
factors can only be caused by the influence of unknown factors.
In the Heckman procedure, the residuals of the selection equation
are used to construct a selection bias control factor, which is
called Lambda and which is equivalent to the Inverse Mill's Ratio.
This factor is a summarizing measure which reflects the effects of
all unmeasured characteristics which are related to
employment/migration. The value of this lambda for each of the
respondents is saved and added to the data file as an additional
variable.
In the second step of the Heckman procedure, the analysis is
performed in which we are interested in the first place, in this
case an OLS regression analysis of the effects of
education/migration on income. In this substantial analysis we use
the selection bias control factor Lambda as an additional
independent variable. Because this factor reflects the effect of all
the unmeasured characteristics which are related to the
employment/migration decision, the coefficient of this factor in the
substantial analysis catches the part of the effect of these
characteristics which is related to income. Because we now have a
control factor in the analysis for the effect of the income related
unmeasured characteristics which are also related to the
employment/migration decision, the other predictors in the equation
are freed from this effect and the regression analysis produces
unbiased coefficients for them.
1.3 Limitations
Before going to the description of the practical estimation of the

Heckman model, a word of caution is in its place. Although
theoretically the procedure sounds rather well, applying it in
practice is not so straightforward. An important condition for its
use is that the selection equation contains at least one variable
which is not related to the dependent variable in the substantial
equation. If such a variable is not present (and sometimes even if
such a variable is present), there may arise severe problems of
multicollinearity and addition of the correction factor to the
substantial equation may lead to estimation difficulties and
unreliable coefficients.
2. Estimation of the standard version with SPSS
2.1 The selection model
2.1.1 Computation of LAMBDA with SPSS PROBIT
To compute the Heckman correction factor Lambda with a PROBIT

selection model, the SPSS procedures PROBIT can be used. This
procedure is a bit laborious, because, after estimation of the
model, the parameter estimates must be typed by hand in a formula to
compute the predicted values of the model (which we need for
computing Lambda). As an alternative, a logit model can be estimated
with the procedure LOGISTIC REGRESSION, which offers the possibility
to save the predicted values automatically. However, in that case a
kind of "trick" must be used to translate the predicted values of
the logit model into "quasi-probit" scores. This alternative is
discussed later.
In our example on the effect of education of women on their
income, the selection model contains the variables age (AGEW) and
education (EDUW) of the woman and number of children (CHILD). The
dependent variable PARTW is an indicator variable with value 1 for
women participating in the labor force and a value 0 for other
women. With PROBIT the procedure goes as follows:
compute SUBJ=1.
PROBIT PARTW of SUBJ with AGEW EDUW CHILD
/log=none /print=none.
In the output of this analysis, we find the estimates of the

parameters. On the basis of these parameters, for each respondent
the predicted probit score can be computed, by typing the parameter
values in the following formula (using an SPSS compute statement):
compute IPS = 0.35020-0.04691*AGEW+0.47745*EDUW+0.46660*CHILD.
With this COMPUTE statement, the individual probit scores (IPS) are
computed and added to the temporary data file. These probit scores
are used to compute the Heckman control factor LAMBDA.
compute LAMBDA =
((1/sqrt(2*3.141592654))*(exp(-IPS*IPS*0.5)))/cdfnorm(IPS).
For applying the two-step procedure it is important that all

respondents with missing values on variables which are used in the
substantial analyses are removed from the active file. This makes
that all following computations are done on the basis of the same
group of respondents. For example:
select if (INCW>0 and EDUW ne -9 and ....).
Now, the help and control factor DELTA is computed:
compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.
The value of DELTA should be between -1 and 0. This offers the

possibility to check whether LAMBDA is computed well.
DESCR DELTA /statistics = min max.
2.1.2 Computation of LAMBDA with SPSS LOGISTIC REGRESSION
A disadvantage of the procedure PROBIT is that this procedure cannot

compute predicted values. The procedure LOGISTIC REGRESSION can do
this. Because Lee (1983) has developed a method to estimate the
selection model with logit analysis, LOGISTIC REGRESSION offers a
less laborious alternative for computing LAMBDA. Estimating the
selection model with LOGISTIC REGRESSION goes as follows:
LOGISTIC REGRESSION PARTW with AGEW EDUW CHILD

/save pred (IKL).
With the instruction " /save pred (IKL) " a new variable is made and
saved under the name IKL, which contains the individual
probabilities predicted by the model. Using the inverse cumulative
distribution function of the normal distribution, these individual
probabilities are translated into the form they would have had when
they would have been computed on the basis of a probit model:
compute IPS = probit(IKL).
The variable IPS now contains the quasi probit scores and can be
used to compute LAMBDA in the same way as when using a probit
selection model:
compute LAMBDA =
Again, cases with missing values on variables involved in the

substantial analysis should be removed from the active file:
select if (INCW>0 and EDUW ne -9 and ....).

Computation of the help and control factor DELTA and testing whether
the value of DELTA is between -1 and 0:
compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.

DESCR DELTA /statistics = min max.
2.2 The substantial analysis
Now LAMBDA is known, we can use it as a correction factor to control

for selection bias in the substantial analysis, which is an OLS
regression analysis estimated with the procedure REGRESSION:
REGRESSION /dep=INCW
/method=enter AGEW EDUW LAMBDA
/save resid (RES).
This analysis produces unbiased parameter estimates for the

independent variables. However, the standard estimates of these
parameters are biased because of heteroschedasticity. The variance
of the error term is not the same for each respondent. To get better
standard errors, several additional steps have to be taken.
2.2.1 Correcting the error terms
First, a command was added to the substantial regression analysis to

save the residuals of the regression model in a new variable (which
is called RES). This variable must be squared:
compute RES2 = RES*RES.
Besides RES, two help variables must be computed. The first one is
the regression coefficient of LAMBDA in the OLS analysis, which is
called LAMB. The second one is the number of cases used in the OLS
regression, called N.
compute LAMB=0.002648.
compute N=9024.
The variable RES2 and also DELTA, which was computed in the first
part of the analysis, have to be summed over all cases. In SPSS this
can be done automatically by first saving the aggregated totals in a
separate file and then reading them in again:
compute HELP = 1.
AGGREGATE /outfile=A /break=HELP
/RESS=sum(RES2)
/DELTAS=sum(DELTA).
MATCH FILES /table=A /file=* /by HELP.
Now the corrected value of the variance (VARC) and the standard
error (SEC) of the error term of the substantial equation can be
estimated:
compute VARC = RESS/N-LAMB*LAMB*DELTAS/N.

compute SEC = sqrt(VARC).
Computation of RHO, the correlation between the error terms of the

selection and substantial equations:
compute RHO = sqrt(LAMB*LAMB/VARC).

If (lamb<0) RHO = 0-RHO.
Now we can take a look at VARC, SEC en RHO:
report /variables=VARC SEC RHO /break=(total)

/summary=mean (VARC(4) SEC(4) RHO(4)).
Computation of the standard errors of the separate observations

(RHOI) and transformation of the standard errors into weights (WGT):
compute RHOI = sqrt(VARC+LAMB*LAMB*DELTA).

compute WGT = 1/RHOI.
Now the corrected standard errors can be computed by running the

substantial analysis again, but this time as Weighted Least Squares
(WLS) regression with WGT as weight:
REGRESSION /dep=INCW
/method=enter AGEW EDUW LAMBDA
/regwgt=WGT.
By combining the parameter estimates of the substantial analysis

with the standard errors of this WLS analysis, the Heckman procedure
is completed. To indicate the explained variance R2 of the analysis,
the R2 of the substantial analysis should be taken.
3. Correcting for heterogeneity bias with SPSS
3.1 The selection model
3.1.1 Computation of LAMBDA with SPSS PROBIT
In our example on the effects of migration on income, the selection

model contains the variables age (AGE) and education (EDU) of the
respondent and number of children (CHILD). The dependent variable
MIGR is an indicator variable with value 1 for respondents who
migrated in the past and a value 0 for other respondents. With
PROBIT the procedure goes as follows:
Estimation of the selection model:
compute SUBJ=1.
PROBIT MIGR of SUBJ with AGE EDU CHILD

/log=none /print=none.
Also in this case, the parameter estimates should be typed by hand

into a formula:
compute IPS = -2.2372-0.02059*AGE+0.32067*EDU+0.05907*CHILD.
The individual probit scores computed with this statement (IPS) are
used again to compute LAMBDA. With heterogeneity bias this
computation is more complex than in the classical selection bias
situation. Because migrants and nonmigrants need correction bias
factors with opposite signs, LAMBDA has to be computed for both
groups separated.
First we compute LAMBDA for the migrants:
If (MIGR=1) LAMBDA =
Then for the nonmigrants:
-((1/sqrt(2*3.141592654))*(exp(-ips*ips*0.5)))/(1-cdfnorm(ips)).
From this point on, the procedure is the same as in the classical
case.
3.1.2 Computation of LAMBDA with SPSS LOGISTIC REGRESSION
Estimating the selection model with LOGISTIC REGRESSION:
LOGISTIC REGRESSION MIGR with AGE EDU CHILD

/save pred (IKL).
The individual probabilities estimated with the logit model are

again transformed in the form they would have had if they were
computed with a probit model:
compute IPS = probit(IKL).
Using the quasi probit scores IPS we again compute LAMBDA separate
for migrants and nonmigrant. First the migrants:
Then the nonmigrants:
-((1/sqrt(2*3.141592654))*(exp(-IPS*IPS*0.5)))/(1-cdfnorm(IPS)).
From this point on, the procedure is the same as in the classical
case.
3.2 The substantial analysis
Now LAMBDA is known for the migration model, the substantial

analysis with the correction factor can be performed:
REGRESSION /dep=INC
/method=enter AGE EDU MIGR LAMBDA
/save resid (RES).
Because this is a case of heterogeneity bias, the migration dummy

(MIGR) has to be in the model too. The coefficient of LAMBDA
indicates whether there is selection bias and what the direction of
this bias is. For example, a significant positive coefficient
indicates that migrants compared to nonmigrants have unmeasured
characteristics which are positively related to income. The
coefficient of MIGR shows how large the difference in income between
migrants and nonmigrants is, after control for the unmeasured
differences between the groups.
Also in the migration analysis, the standard estimates of the
regression coefficients have to be corrected because of
heteroschedasticity. This can be done with the same procedure as in
the example of classical selection bias in B.1.4.
References
Breen, Richard (1996). Regression Models: Censored, Sample Selected,
or Truncated Data. Sage University Paper no. 111. Thousand Oaks:
Sage.
Heckman, James J. (1976). The Common Structure of Statistical

Models of Truncation, Sample Selection and Limited Dependent
Variables and a Simple Estimator for Such Models. Annals of
Economic and Social Measurement, 5: 475-492.
Heckman, James J. (1979). Sample Selection Bias as a Specification

Error. Econometrica, 47: 153-161.
Lee, Lung-Fei (1983). Generalized Econometric Models With

Selectivity. Econometrica, 51:507-513.
Ploeg, Sjerp van der (1993). The Expansion of Secondary and Tertiary
Education in the Netherlands. PhD Thesis, Tilburg University.
Nijmegen: ITS.
Winship, Christopher, and Robert D. Mare (1992). Models For Sample

Selection Bias. Annual Review of Sociology, 18:327-50

Selection Bias (Heckman-SPSS)

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Selection Bias (Heckman-SPSS)

Încărcat de

Drepturi de autor:

Formate disponibile

Jeroen Smits (http://home.planet.nl/~smits.

jeroen) Heckman with SPSS 1

Estimating the Heckman two-step procedure to control

This paper shortly discusses two main forms of the selection

1.1 Selection bias

There are basically two versions of the selection bias problem. In

education as one of the predictors may lead to a biased estimates of

1.2 The Heckman procedure

In the first step of the Heckman procedure, the selection process

Before going to the description of the practical estimation of the

2. Estimation of the standard version with SPSS

2.1 The selection model

2.1.1 Computation of LAMBDA with SPSS PROBIT

To compute the Heckman correction factor Lambda with a PROBIT

In the output of this analysis, we find the estimates of the

compute IPS = 0.35020-0.04691*AGEW+0.47745*EDUW+0.46660*CHILD.

For applying the two-step procedure it is important that all

select if (INCW>0 and EDUW ne -9 and ....).

Now, the help and control factor DELTA is computed:

compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.

The value of DELTA should be between -1 and 0. This offers the

DESCR DELTA /statistics = min max.

2.1.2 Computation of LAMBDA with SPSS LOGISTIC REGRESSION

A disadvantage of the procedure PROBIT is that this procedure cannot

LOGISTIC REGRESSION PARTW with AGEW EDUW CHILD

compute IPS = probit(IKL).

Again, cases with missing values on variables involved in the

select if (INCW>0 and EDUW ne -9 and ....).

compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.

2.2 The substantial analysis

Now LAMBDA is known, we can use it as a correction factor to control

This analysis produces unbiased parameter estimates for the

2.2.1 Correcting the error terms

First, a command was added to the substantial regression analysis to

compute RES2 = RES*RES.

compute VARC = RESS/N-LAMB*LAMB*DELTAS/N.

Computation of RHO, the correlation between the error terms of the

compute RHO = sqrt(LAMB*LAMB/VARC).

Now we can take a look at VARC, SEC en RHO:

report /variables=VARC SEC RHO /break=(total)

Computation of the standard errors of the separate observations

compute RHOI = sqrt(VARC+LAMB*LAMB*DELTA).

Now the corrected standard errors can be computed by running the

By combining the parameter estimates of the substantial analysis

3. Correcting for heterogeneity bias with SPSS

3.1 The selection model

3.1.1 Computation of LAMBDA with SPSS PROBIT

In our example on the effects of migration on income, the selection

Estimation of the selection model:

PROBIT MIGR of SUBJ with AGE EDU CHILD

Also in this case, the parameter estimates should be typed by hand

compute IPS = -2.2372-0.02059*AGE+0.32067*EDU+0.05907*CHILD.

Then for the nonmigrants:

3.1.2 Computation of LAMBDA with SPSS LOGISTIC REGRESSION

Estimating the selection model with LOGISTIC REGRESSION:

LOGISTIC REGRESSION MIGR with AGE EDU CHILD

The individual probabilities estimated with the logit model are

compute IPS = probit(IKL).

Then the nonmigrants:

3.2 The substantial analysis

Now LAMBDA is known for the migration model, the substantial

Because this is a case of heterogeneity bias, the migration dummy

Heckman, James J. (1976). The Common Structure of Statistical

Heckman, James J. (1979). Sample Selection Bias as a Specification

compute IPS = 0.35020-0.04691AGEW+0.47745EDUW+0.46660*CHILD.

compute DELTA = -LAMBDAIPS-LAMBDALAMBDA.

compute DELTA = -LAMBDAIPS-LAMBDALAMBDA.

compute VARC = RESS/N-LAMBLAMBDELTAS/N.

compute RHOI = sqrt(VARC+LAMBLAMBDELTA).

compute IPS = -2.2372-0.02059AGE+0.32067EDU+0.05907*CHILD.