Sunteți pe pagina 1din 199

The Stata Journal

Volume 10 Number 3 2010

A Stata Press publication


StataCorp LP
College Station, Texas
The Stata Journal
Editor
H. Joseph Newton
Department of Statistics
Texas A&M University
College Station, Texas 77843
979-845-8817; fax 979-845-6077
jnewton@stata-journal.com
Editor
Nicholas J. Cox
Department of Geography
Durham University
South Road
Durham DH1 3LE UK
n.j.cox@stata-journal.com
Associate Editors
Christopher F. Baum
Boston College
Nathaniel Beck
New York University
Rino Bellocco
Karolinska Institutet, Sweden, and
University of Milano-Bicocca, Italy
Maarten L. Buis
T ubingen University, Germany
A. Colin Cameron
University of CaliforniaDavis
Mario A. Cleves
Univ. of Arkansas for Medical Sciences
William D. Dupont
Vanderbilt University
David Epstein
Columbia University
Allan Gregory
Queens University
James Hardin
University of South Carolina
Ben Jann
University of Bern, Switzerland
Stephen Jenkins
University of Essex
Ulrich Kohler
WZB, Berlin
Frauke Kreuter
University of MarylandCollege Park
Peter A. Lachenbruch
Oregon State University
Jens Lauritsen
Odense University Hospital
Stanley Lemeshow
Ohio State University
J. Scott Long
Indiana University
Roger Newson
Imperial College, London
Austin Nichols
Urban Institute, Washington DC
Marcello Pagano
Harvard School of Public Health
Sophia Rabe-Hesketh
University of CaliforniaBerkeley
J. Patrick Royston
MRC Clinical Trials Unit, London
Philip Ryan
University of Adelaide
Mark E. Schaer
Heriot-Watt University, Edinburgh
Jeroen Weesie
Utrecht University
Nicholas J. G. Winter
University of Virginia
Jerey Wooldridge
Michigan State University
Stata Press Editorial Manager
Stata Press Copy Editors
Lisa Gilmore
Deirdre Patterson and Erin Roberson
The Stata Journal publishes reviewed papers together with shorter notes or comments,
regular columns, book reviews, and other material of interest to Stata users. Examples
of the types of papers include 1) expository papers that link the use of Stata commands
or programs to associated principles, such as those that will serve as tutorials for users
rst encountering a new eld of statistics or a major new technique; 2) papers that go
beyond the Stata manual in explaining key features or uses of Stata that are of interest
to intermediate or advanced users of Stata; 3) papers that discuss new commands or
Stata programs of interest either to a wide spectrum of users (e.g., in data management
or graphics) or to some large segment of Stata users (e.g., in survey statistics, survival
analysis, panel analysis, or limited dependent variable modeling); 4) papers analyzing
the statistical properties of new or existing estimators and tests in Stata; 5) papers
that could be of interest or usefulness to researchers, especially in elds that are of
practical importance but are not often included in texts or other journals, such as the
use of Stata in managing datasets, especially large datasets, with advice from hard-won
experience; and 6) papers of interest to those who teach, including Stata with topics
such as extended examples of techniques and interpretation of results, simulations of
statistical concepts, and overviews of subject areas.
For more information on the Stata Journal, including information for authors, see the
webpage
http://www.stata-journal.com
The Stata Journal is indexed and abstracted in the following:
CompuMath Citation Index
R
Current Contents/Social and Behavioral Sciences
R
RePEc: Research Papers in Economics
Science Citation Index Expanded (also known as SciSearch
R
)
Scopus
TM
Social Sciences Citation Index
R
Copyright Statement: The Stata Journal and the contents of the supporting les (programs, datasets, and
help les) are copyright c by StataCorp LP. The contents of the supporting les (programs, datasets, and
help les) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
leservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
les understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of tness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of prots. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, Mata, NetCourse,
and Stata Press are registered trademarks of StataCorp LP.
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station,
Texas 77845, telephone 979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html
Subscription rates
The listed subscription rates include both a printed and an electronic copy unless oth-
erwise mentioned.
Subscriptions mailed to U.S. and Canadian addresses:
3-year subscription $195
2-year subscription $135
1-year subscription $ 69
1-year student subscription $ 42
1-year university library subscription $ 89
1-year institutional subscription $195
Subscriptions mailed to other countries:
3-year subscription $285
2-year subscription $195
1-year subscription $ 99
3-year subscription (electronic only) $185
1-year student subscription $ 69
1-year university library subscription $119
1-year institutional subscription $225
Back issues of the Stata Journal may be ordered online at
http://www.stata.com/bookstore/sjj.html
Individual articles three or more years old may be accessed online without charge. More
recent articles may be ordered online.
http://www.stata-journal.com/archives.html
The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive,
College Station, TX 77845, USA, or emailed to sj@stata.com.
Volume 10 Number 3 2010
The Stata Journal
Articles and Columns 315
An introduction to maximum entropy and minimum cross-entropy estimation us-
ing Stata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Wittenberg 315
bacon: An eective way to detect outliers in multivariate data using Stata (and
Mata) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Weber 331
Comparing the predictive powers of survival models using Harrells C or Somers
D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. B. Newson 339
Using Stata with PHASE and Haploview: Commands for importing and exporting
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. C. Huber Jr. 359
simsum: Analyses of simulation studies including Monte Carlo error . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I. R. White 369
Projection of power and events in clinical trials with a time-to-event outcome . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Royston and F. M.-S. Barthel 386
metaan: Random-eects meta-analysis . . . . . . . . . . E. Kontopantelis and D. Reeves 395
Regression analysis of censored data using pseudo-observations . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. T. Parner and P. K. Andersen 408
Estimation of quantile treatment eects with Stata . . . . . M. Fr olich and B. Melly 423
Translation from narrative text to standard codes variables with Stata. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Belotti and D. Depalo 458
Speaking Stata: The limits of sample skewness and kurtosis . . . . . . . . . . . N. J. Cox 482
Notes and Comments 496
Stata tip 89: Estimating means and percentiles following multiple imputation. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. A. Lachenbruch 496
Stata tip 90: Displaying partial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Weiss 500
Stata tip 91: Putting unabbreviated varlists into local macros . . . . . . . . . N. J. Cox 503
Software Updates 505
The Stata Journal (2010)
10, Number 3, pp. 315330
An introduction to maximum entropy and
minimum cross-entropy estimation using Stata
Martin Wittenberg
University of Cape Town
School of Economics
Cape Town, South Africa
Martin.Wittenberg@uct.ac.za
Abstract. Maximum entropy and minimum cross-entropy estimation are applica-
ble when faced with ill-posed estimation problems. I introduce a Stata command
that estimates a probability distribution using a maximum entropy or minimum
cross-entropy criterion. I show how this command can be used to calibrate survey
data to various population totals.
Keywords: st0196, maxentropy, maximum entropy, minimum cross-entropy, survey
calibration, sample weights
1 Ill-posed problems and the maximum entropy criterion
All too many situations involve more unknowns than data points. Standard forms
of estimation are impossible when faced with such ill-posed problems (Mittelhammer,
Judge, and Miller 2000). One approach that is applicable in these cases is estimation
by maximizing an entropy measure (Golan, Judge, and Miller 1996). The purpose of
this article is to introduce the concept and to show how to apply it using the new
Stata command maxentropy. My discussion of the technique follows the treatment in
Golan, Judge, and Miller (1996). Furthermore, I show how a maximum entropy ap-
proach can be used to calibrate survey data to various population totals. This approach
is equivalent to the iterative raking procedure of Deming and Stephan (1940) or the
multiplicative method implemented in the calibration on margins (CALMAR) algorithm
(Deville and S arndal 1992; Deville, S arndal, and Sautory 1993).
The idea of maximum entropy estimation was motivated by Jaynes (1957, 621) in
terms of the problem of nding the probability distribution (p
1
, p
2
, . . . , p
n
) for the set
of values (x
1
, x
2
, . . . , x
n
), given only their expectation,
E {f (x)} =
n

i=1
p
i
f (x
i
)
For concreteness, we consider a die known to have E (x) = 3.5, where x =(1, 2, 3, 4, 5, 6),
and we want to determine the associated probabilities. Clearly, there are innitely many
possible solutions, but the obvious one is p
1
= p
2
= = p
6
= 1/6. The obviousness is
based on Laplaces principle of insucient reason, which states that two events should
be assigned equal probability unless there is a reason to think otherwise (Jaynes 1957,
c 2010 StataCorp LP st0196
316 Maximum entropy estimation
622). This negative reason is not much help if, instead, we know that E (x) = 4.
Jayness solution was to tackle this from the point of view of Shannons information
theory. Jaynes wanted a criterion function H (p
1
, p
2
, . . . , p
n
) that would summarize the
uncertainty about the distribution. This is given uniquely by the entropy measure
H (p
1
, p
2
, . . . , p
n
) = K
n

i=1
p
i
ln(p
i
)
where p
i
ln(p
i
) is dened to be zero if p
i
= 0 for some positive constant K. The
solution to Jayness problem is to pick the distribution (p
1
, p
2
, . . . , p
n
) that maximizes
the entropy, subject only to the constraints
E {f (x)} =

i
p
i
f (x
i
)

i
p
i
= 1
As Golan, Judge, and Miller (1996, 810) show, if our knowledge of E {f (x)} is
based on the outcome of N (very large) trials, then the distribution function p = (p
1
, p
2
,
. . . , p
n
) that maximizes the entropy measure is the distribution that can give rise to
the observed outcomes in the greatest number of ways, which is consistent with what
we know. Any other distribution requires more information to justify it. Degenerate
distributions, ones where p
i
= 1 and p
j
= 0 for j = i, have entropy of zero. That is to
say, they correspond to zero uncertainty and therefore maximal information.
2 Maximum entropy and minimum cross-entropy estima-
tion
More formally, the maximum entropy problem can be represented as
max
p
H (p) =
n

i=1
p
i
ln(p
i
)
such that y
j
=
n

i=1
X
ji
p
i
, j = 1, . . . , J (1)
n

i=1
p
i
= 1 (2)
The J constraints given in (1) can be thought of as moment constraints, with y
j
being
the population mean of the X
j
random variable. To solve this problem, we set up the
Lagrangian function
L = p

ln(p) (X

p y) (p

1 1)
M. Wittenberg 317
where X is the n J data matrix,
1
is a vector of Lagrange multipliers, and 1 is a
column vector of ones.
The rst-order conditions for an interior solutionthat is, one in which the vector
p is strictly positiveare given by
L
p
= ln( p)1 X

1 = 0 (3)
L

= y X

p = 0 (4)
L

= 1 p

1 = 0 (5)
These equations can be solved for

, and the solution for p is given by
p = exp
_
X

_
/
_

_
where

_
=
n

i=1
exp
_
x
i

_
and x
i
is the ith row vector of the matrix X.
The maximum entropy framework can be extended to incorporate prior information
about p. Assuming that we have the prior probability distribution q =(q
1
, q
2
, . . . , q
n
),
then the cross-entropy is dened as (Golan, Judge, and Miller 1996, 11)
I (p, q) =
n

i=1
p
i
ln
_
p
i
q
i
_
= p

ln(p) p

ln(q)
The cross-entropy can be thought of as a measure of the additional information required
to go from the distribution q to the distribution p. The principle of minimum cross-
entropy asserts that we should pick the distribution p that meets the moment constraints
(1) and the normalization restriction (2) while requiring the least additional information;
that is, we should pick the one that is in some sense closest to q. Formally, we minimize
I (p, q), subject to the restrictions. Maximum entropy estimation is merely a variant
of minimum cross-entropy estimation where the prior q is the uniform distribution
(1/n, 1/n, . . . , 1/n).
1. In the Golan, Judge, and Miller (1996) book, the constraint is written as y = Xp, where X is Jn.
For the applications considered below, it is more natural to write the data matrix in the form shown
here.
318 Maximum entropy estimation
The solution of this problem is given by (Golan, Judge, and Miller 1996, 29)
p
i
= q
i
exp
_
x
i

_
/
_

_
(6)
where

_
=
n

i=1
q
i
exp
_
x
i

_
(7)
The most ecient way to calculate the estimates is, in fact, not by numerical solution
of the rst-order conditions [along the lines of (3), (4), and (5)] but by the unconstrained
maximization of the dual problem as discussed further in section 3.5.
3 The maxentropy command
3.1 Syntax
The syntax of the maxentropy command is
maxentropy
_
constraint

varlist
_
if
_
in

, generate(varname
_
, replace

)
_
prior(varname) log total(#) matrix(matrix)

The maxentropy command must identify the set of population constraints contained in
the y vector. These population constraints can be specied either as
_
constraint

or as
matrix in the matrix() option. If neither of these optional arguments is specied, it is
assumed that varlist is y and then X.
The command requires that a varname be specied in the generate() option, in
which the estimated p vector will be returned.
3.2 Description
maxentropy provides minimum cross-entropy or maximum entropy estimates of ill-
posed inverse problems, such as the Jayness dice problem. The command can also
be used to calibrate survey datasets to external totals along the lines of the multi-
plicative method implemented in the SAS CALMAR macro (Deville and S arndal 1992;
Deville, S arndal, and Sautory 1993). This is a generalization of iterative raking as im-
plemented, for instance, in Nick Winters survwgt command, which is available from
the Statistical Software Components archive (type net search survwgt).
3.3 Options
generate(varname
_
, replace

) provides the variable name in which Stata will store


the probability estimates. This must be a new variable name, unless the replace
suboption is specied, in which case the existing variable is overwritten. generate()
is required.
M. Wittenberg 319
prior(varname) requests minimum cross-entropy estimation with the vector of prior
probabilities q given in the variable varname. If prior() is not specied, then
maximum entropy estimates are returned.
log is necessary only if the command is failing to converge. This option species to
display the output from the maximum likelihood subroutine that is used to calculate
the vector . The iteration log might provide some diagnostics on what is going
wrong.
total(#) is required if raising weights rather than probabilities are desired. The
number must be the population total to which the weights are supposed to be
summed.
matrix(matrix) passes the constraint vector contained in matrix. This must be a col-
umn vector that must have as many elements as are given in varlist. The order of
the constraints in the vector must correspond to the order of the variables given in
varlist. If no matrix is specied, then maxentropy will look for the constraints in
the rst variable after the command. This variable must have the constraints listed
in the rst J positions corresponding to the J variables listed in varlist.
3.4 Output
maxentropy returns output in three forms. First, it returns estimates of the coe-
cients. The absolute magnitude of the coecient is an indication of how informative
the corresponding constraint is, that is, how far it moves the resulting p distribution
away from the prior q distribution in the cross-entropy case or away from the uniform
distribution in the maximum entropy case.
Second, the estimates of p are returned in the variable specied by the user. Third,
the vector of constraints y is returned in the matrix e(constraint), with the rows of
the matrix labeled according to the variable whose constraint that row represents.
Example
Consider the Jayness die problem described earlier. Specically, let us calculate the
probabilities if we know that the mean of the die is 4. We set the problem up by
creating the x variable, which contains the discrete distribution of outcomes, that is,
(1, 2, 3, 4, 5, 6). The y vector contains the mean 4.
. set obs 6
obs was 0, now 6
. generate x = _n
. matrix y = (4)
(Continued on next page)
320 Maximum entropy estimation
. maxentropy x, matrix(y) generate(p4)
Cross entropy estimates
Variable lambda
x .17462893
p values returned in p4
constraints given in matrix y
The value corresponding to the constraint E(x) = 4 is 0.1746289, so the constraint
is informative, that is, the resulting distribution is no longer the uniform one. The
message at the end reminds us where the rest of the output is to be obtained (that is,
in the p4 variable) and that the constraints were passed by means of a Stata matrix.
To see the p estimate itself, we can just list the variable:
. list x p4, noobs sep(10)
x p4
1 .1030653
2 .1227305
3 .146148
4 .1740337
5 .2072401
6 .2467824
The distribution is weighted toward the larger numbers. We can check that these
estimates obey the restrictions:
. generate xp4=x*p4
. quietly summarize xp4
. display r(sum)
4
Finally, we can retrieve a copy of the constraint matrix labeled with the correspond-
ing variables.
. matrix list e(constraint)
symmetric e(constraint)[1,1]
c1
x 4
3.5 Methods and formulas
Instead of solving the constrained optimization problem given by the rst-order condi-
tions [(3) to (5)] or their cross-entropy analogues, Golan, Judge, and Miller (1996, 30)
show that the solution can be found by maximizing the unconstrained dual cross-entropy
objective function
M. Wittenberg 321
L() =
J

j=1

j
y
j
ln {()} = M() (8)
where () is given by (7). Golan, Judge, and Miller show that this function behaves
like a maximum likelihood. In this case,

M() = y X

p (9)
so that the constraint is met at the point where the gradient is zero. Furthermore,


2
M

2
j
=
n

i=1
p
i
x
2
ji

_
n

i=1
p
i
x
ji
_
2
= var (x
j
) (10)


2
M

k
=
n

i=1
p
i
x
ji
x
ki

_
n

i=1
p
i
x
ji
__
n

i=1
p
i
x
ki
_
= cov (x
j
, x
k
) (11)
where the variances and covariances are taken with respect to the distribution p. The
negative of the Hessian of M is therefore guaranteed to be positive denite, which
guarantees a unique solution provided that the constraints are not inconsistent.
Golan, Judge, and Miller (1996, 25) note that the function M can be thought of
as an expected log likelihood, given the exponential family p() parameterized by .
Along these lines, we use Statas maximum likelihood routines to estimate , giving it
the dual objective function [(8)], gradient [(9)], and negative Hessian [(10) and (11)].
The routine that calculates these is contained in maxentlambda d2.ado. Because of
the globally concave nature of the objective function, convergence should be relatively
quick, provided that there is a feasible solution in the interior of the parameter space.
The command checks for some obvious errors; for example, the population means (y
j
)
must be inside the range of the X
j
variables. If any mean is on the boundary of the
range, then a degenerate solution is feasible, but the corresponding Lagrange multiplier
will be , so the algorithm will not converge.
Once the estimates of have been obtained, estimates of p are derived from (6).
3.6 Saved results
maxentropy saves the following in e():
Macros
e(cmd) maxentropy e(properties) b V
Matrices
e(b) coecient estimates e(V) inverse of negative Hessian
e(constraint) constraint vector
Functions
e(sample) marks estimation sample
322 Maximum entropy estimation
3.7 A cautionary note
The estimation routine treats as though it were estimated by maximum likelihood.
This is true only if we can write p as
p exp (X)
Given that assumption, we could test hypotheses on the parameters. Because the esti-
mation routine calculates the inverse of the negative of the Hessian (that is, the asymp-
totic covariance matrix of under this parametric assumption), it would be possible to
implement such tests. For most practical applications, this parametric interpretation of
the procedure is likely to be dubious.
4 Sample applications
4.1 Jayness die problem
In section 3.4, I showed how to calculate the probability distribution given that y = 4.
The following code generates predictions given dierent values for y:
matrix y=(2)
maxentropy x, matrix(y) generate(p2)
matrix y=(3)
maxentropy x, matrix(y) generate(p3)
matrix y=(3.5)
maxentropy x, matrix(y) generate(p35)
matrix y=(5)
maxentropy x, matrix(y) generate(p5)
list p2 p3 p35 p4 p5, sep(10)
The impact of dierent prior information on the estimated probabilities is shown in
the following table:
. list p2 p3 p35 p4 p5, sep(10)
p2 p3 p35 p4 p5
1. .4781198 .2467824 .1666667 .1030653 .0205324
2. .254752 .2072401 .1666667 .1227305 .0385354
3. .135737 .1740337 .1666667 .146148 .0723234
4. .0723234 .146148 .1666667 .1740337 .135737
5. .0385354 .1227305 .1666667 .2072401 .2547519
6. .0205324 .1030652 .1666667 .2467824 .4781198
Note in particular that when we set y = 3.5, the command returns the uniform
discrete distribution with p
i
= 1/6.
We can see the impact of adding in a second constraint by considering the same
problem given the population moments
y =
_

2
_
=
_
3.5

2
_
M. Wittenberg 323
for dierent values of
2
. By denition in this case,
2
=

6
i=1
p
i
(x
i
3.5)
2
. We
can therefore create the values (x
i
3.5)
2
and consider which probability distribution
p =(p
1
, p
2
, . . . , p
6
) will generate both a mean of 3.5 and a given value of
2
. The code
to run this is
generate dev2=(x-3.5)^2
matrix y=(3.5 \ (2.5^2/3+1.5^2/3+0.5^2/3))
maxentropy x dev2, matrix(y) generate(pv)
matrix y=(3.5 \ 1)
maxentropy x dev2, matrix(y) generate(pv1)
matrix y=(3.5 \ 2)
maxentropy x dev2, matrix(y) generate(pv2)
matrix y=(3.5 \ 3)
maxentropy x dev2, matrix(y) generate(pv3)
matrix y=(3.5 \ 4)
maxentropy x dev2, matrix(y) generate(pv4)
matrix y=(3.5 \ 5)
maxentropy x dev2, matrix(y) generate(pv5)
matrix y=(3.5 \ 6)
maxentropy x dev2, matrix(y) generate(pv6)
with the following nal result:
. list pv1 pv2 pv pv3 pv4 pv5 pv6, sep(10) noobs
pv1 pv2 pv pv3 pv4 pv5 pv6
.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601
.1316041 .1719114 .1666667 .1651027 .1358892 .0896692 .0234196
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.1316041 .1719113 .1666667 .1651026 .1358892 .0896692 .0234196
.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601
The probabilities behave as we would expect: in the case where
2
= 35/12, we get
the uniform distribution. With variances smaller than this, the probability distribution
puts more emphasis on the values 3 and 4, while with higher variances the distribution
becomes bimodal with greater probability being attached to extreme values. This output
does not reveal that in all cases the
1
estimate is basically zero. The reason for this
is that with a symmetrical distribution of x
i
values around the population mean, the
mean is no longer informative and all the information about the distribution of p derives
from the second constraint. If we force p
4
= p
5
= 0 so that the distribution is no longer
symmetrical, the rst constraint becomes informative, as shown in this output:
(Continued on next page)
324 Maximum entropy estimation
. maxentropy x dev2 if x!=5&x!=4, matrix(y) generate(p5, replace)
Cross entropy estimates
Variable lambda
x .0119916
dev2 .59568007
p values returned in p5
constraints given in matrix y
. list x p5 if e(sample), noobs
x p5
1 .4578909
2 .0427728
3 .0131515
6 .4861848
This example shows how to overwrite an existing variable and demonstrates that the
command allows if and in qualiers. It also shows how to use the e(sample) function.
4.2 Calibrating a survey
The basic point of calibration is to adjust the sampling weights so that the marginal
totals in dierent categories correspond to the population totals. Typically, the ad-
justments are made on demographic (for example, age and gender) and spatial vari-
ables. Early approaches included iterative raking procedures (Deming and Stephan
1940). These were generalized in the CALMAR routines described in Deville and S arndal
(1992). The idea of using a minimum information loss criterion for this purpose is not
original (see, for instance, Merz and Stolze [2008]), although it does not seem to have
been appreciated that the procedure leads to identical estimates as iterative raking-ratio
adjustments, if those adjustments are iterated to convergence.
The major advantage of using the cross-entropy approach rather than raking is that it
becomes straightforward to incorporate constraints that do not include marginal totals.
In many household surveys, for instance, it is plausible that mismatches between the
sample and the population arise due to dierential success in sampling household types
rather than in enumerating individuals within households. Under these conditions, it
makes sense to require that all raising weights within a household be identical. I give
an example below that shows how cross-entropy estimation with such a constraint can
be feasibly implemented.
These capacities also exist within other calibration macros and commands. The
advantage of the maxentropy command is that it can do so within Stataand it is
fairly easy and quick to use.
To demonstrate these possibilities, we load example1.dta, which contains a hypo-
thetical survey with a set of prior weights. The sum of these weights by stratum and
M. Wittenberg 325
gender is given in table 1, where we have also indicated the population totals to which
the weights should gross.
Table 1. Sum of weights from example1.dta by stratum, gender, and gross weight for
population totals
gender
stratum 0 1 Margin Required
0 100 400 500 1600
1 300 200 500 400
Margin 400 600 1000
Required 1200 800 2000
The weights can be adjusted to these totals by using the downloadable survwgt
command. To use the maxentropy command, we need to convert the desired constraints
from population totals into population means. That is straightforward because
N =
n

i=1
w
i
(12)
N
gender=0
=
n

i=1
w
i
1(gender = 0) (13)
where 1(gender = 0) is the indicator function. So dividing everything by N, the popu-
lation total, we get a set of constraints that look identical to those used earlier:
1 =
n

i=1
w
i
N
=
n

i=1
p
i
Pr (gender = 0) =
n

i=1
w
i
N
1(gender = 0)
=
n

i=1
p
i
1(gender = 0)
We could obviously add a condition for the proportion where gender = 1, but because
of the adding-up constraint, that would be redundant. If we have k categories for a
particular variable, we can only use k 1 constraints in our estimation.
In this particular example, the constraint vector is contained in the constraint
variable. The syntax of the command in this case is
maxentropy constraint stratum gender, generate(wt3) prior(weight) total(2000)
326 Maximum entropy estimation
We did not specify a matrix, so the rst variable is interpreted as the constraint
vector. We did specify a prior weight and asked Stata to convert the calculated proba-
bilities to raising weights by multiplying them by 2,000. A comparison with the raked
weights conrms them to be identical in this case.
We can check whether the constraints were correctly rendered by retrieving the
constraint matrix used in the estimation:
. matrix C=e(constraint)
. matrix list C
C[2,1]
c1
stratum .2
gender .40000001
We see that E(stratum) = 0.2 and E(gender) = 0.4. Means of dummy variables
are, of course, just population proportions; that is, the proportion in stratum = 1 is
0.2 and the proportion where gender = 1 is 0.4.
4.3 Imposing constant weights within households
In most household surveys, the household is the unit that is sampled and the individuals
are enumerated within it. Consequently, the probability of including an individual
conditional on the household being selected is 1. This suggests that the weight attached
to every individual within a household should be equal. We can impose this restriction
with a fairly simple ploy. We rewrite constraint (12) by rst summing over individuals
within the household (hhsize) and then summing over households as
N =

i
w
ih
=

h
hhsize
h
w
h
that is,
N =

h
w

h
where w
ih
is the weight of individual i within household h, equal to the common weight
w
h
. This constraint can again be written in the form of probabilities as
1 =

h
w

h
N
that is,
1 =

h
p

h
M. Wittenberg 327
Consider now any other constraint involving individual aggregates [for example, (13)]
N
x
=
n

i=1
w
i
x
i
=

i
w
ih
x
ih
=

h
w
h
_

i
x
ih
_
N
x
N
=

h
w
h
hhsize
h
N

i
x
ih
hhsize
h
Consequently,
E (x) =

h
p

h
m
xh
(14)
The term m
xh
is just the mean of the x variable within household h.
If the prior weight q
h
is similarly constant within households (as it should be if it is
a design weight), then we similarly create a new variable
q

h
= hhsize
h
q
h
We can then write the cross-entropy objective function as
I (p, q) =
n

i=1
p
i
ln
_
p
i
q
i
_
=

i
p
ih
ln
_
p
ih
q
ih
_
=

i
p
h
ln
_
p
h
hhsize
h
q
h
hhsize
h
_
=

i
p
h
ln
_
p

h
q

h
_
=

h
hhsize
h
p
h
ln
_
p

h
q

h
_
=

h
p

h
ln
_
p

h
q

h
_
In short, the objective function evaluated over all individuals and imposing the con-
straint p
ih
= p
h
for all i is identical to the objective function evaluated over house-
holds where the probabilities have been adjusted to p

h
and q

h
. We therefore run the
maxentropy command on a household-level le, with the population constraints given
by (14). Our cross-entropy estimates can then be retrieved as
p
h
=
p

h
hhsize
h
328 Maximum entropy estimation
We can check that the weights obtained in this way do, in fact, obey all the
restrictionsthey are obviously constant within household, and when added up over
the individuals, they reproduce the required totals.
4.4 Calibrating the South African National Income Dynamics Survey
To assess the performance of the maxentropy command on a more realistic problem, we
consider the problem of calibrating South Africas National Income Dynamics Survey.
This was a nationally representative sample of around 7,300 households and around
30,000 individuals. From the sampling design, a set of design weights were calculated,
but application of these weights to the realized sample led to a severe undercount when
compared with the ocial population estimates.
The calibration was to be done to reproduce the nine provincial population counts
and 136 age sex race cell totals. One practical diculty that was immediately
encountered was how to treat individuals where age, sex, or race information was miss-
ing, because this category does not exist in the national estimates. It was decided to
keep the relative weights of the missing observations constant through the calibration,
creating a 137th age, sex, and race category. From each group of dummy variables, one
category had to be omitted, creating altogether 144 (or 8 + 136) constraints.
hhcollapsed.dta contains household-level means of all these variables plus the
household design weights. The code to create cross-entropy weights that are constant
within households is given by the following:
use hhcollapsed
maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)
replace hw=hw/hhsize
matrix list e(constraint)
With 144 constraints and 7,305 observations, the command took 18 seconds to cal-
culate the new weights on a standard desktop computer.
M. Wittenberg 329
In this context, the estimates prove informative. The output of the command is
. maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)
Cross entropy estimates
Variable lambda
P1 -.15945276
P2 .00735986
P3 .14000206
(output omitted )
IMa75 15.402056
IMa80 8.6501559
IFa_0 -7.0753612
IFa_5 2.3584972
(output omitted )
IFa75 -9.2778495
IFa80 14.142518
(output omitted )
WFa70 .05009103
WFa75 .90961156
WFa80 4.6868009
p values returned in hw
constraints given in variable constraints
The huge coecients for old Indian males and old Indian females suggests that the
population constraints aected the weights for these categories substantially. Given the
large number of constraints, mistakes are possible. The easiest way to check that the
command has worked correctly is to add up the weights within categories and to check
that they add up to the intended totals. Listing the constraint matrix used by the
command is also a useful check. In this case, the labeling of the rows does help:
. matrix list e(constraint)
e(constraint)[144,1]
c1
P1 .10803039
P2 .13514069
P3 .02320805
P4 .05914972
P5 .20764017
P6 .07044568
P7 .21462312
P8 .07373177
AMa_0 .04486157
AMa_5 .04584822
(output omitted )
WFa75 .0012318
WFa80 .00147087
The rst eight constraints are the province proportions followed by the proportions
in the age, sex, and race cells.
330 Maximum entropy estimation
5 Conclusion
This article introduced the power of maximum entropy and minimum cross-entropy
estimation. The maxentropy command uses Statas powerful maximum-likelihood esti-
mation routines to provide fast estimates of even complicated problems. I have shown
how the command can be used to calibrate a survey to a set of known population totals
while imposing restrictions like constant weights within households.
6 References
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sample fre-
quency table when the expected marginal totals are known. Annals of Mathematical
Statistics 11: 427444.
Deville, J.-C., and C.-E. S arndal. 1992. Calibration estimators in survey sampling.
Journal of the American Statistical Association 87: 376382.
Deville, J.-C., C.-E. S arndal, and O. Sautory. 1993. Generalized raking procedures in
survey sampling. Journal of the American Statistical Association 88: 10131020.
Golan, A., G. G. Judge, and D. Miller. 1996. Maximum Entropy Econometrics: Robust
Estimation with Limited Data. Chichester, UK: Wiley.
Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review 106:
620630.
Merz, J., and H. Stolze. 2008. Representative time use data and new harmonised cali-
bration of the American Heritage Time Use Data 19651999. electronic International
Journal of Time Use Research 5: 90126.
Mittelhammer, R. C., G. G. Judge, and D. J. Miller. 2000. Econometric Foundations.
Cambridge: Cambridge University Press.
About the author
Martin Wittenberg teaches core econometrics and microeconometrics to graduate students in
the Economics Department at the University of Cape Town.
The Stata Journal (2010)
10, Number 3, pp. 331338
bacon: An eective way to detect outliers in
multivariate data using Stata (and Mata)
Sylvain Weber
University of Geneva
Department of Economics
Geneva, Switzerland
sylvain.weber@unige.ch
Abstract. Identifying outliers in multivariate data is computationally intensive.
The bacon command, presented in this article, allows one to quickly identify out-
liers, even on large datasets of tens of thousands of observations. bacon constitutes
an attractive alternative to hadimvo, the only other command available in Stata
for the detection of outliers.
Keywords: st0197, bacon, hadimvo, outliers detection, multivariate outliers
1 Introduction
The literature on outliers is abundant, as proved by Barnett and Lewiss (1994) bibli-
ography of almost 1,000 articles. Despite this considerable research by the statistical
community, knowledge apparently fails to spill over, so proper methods for detecting
and handling outliers are seldom used by practitioners in other elds.
The reason is likely that algorithms implemented for the detection of outliers are
sparse. Moreover, the few algorithms available are so time-consuming that using them
may be discouraging. Until now, hadimvo was the only command in Stata available for
identifying outliers. Anyone who has tried to use hadimvo on large datasets, however,
knows it may take hours or even days to obtain a mere dummy variable indicating which
observations should be considered as outliers.
The new bacon command, presented in this article, provides a more ecient way
to detect outliers in multivariate data. It is named for the blocked adaptive computa-
tionally ecient outlier nominators (BACON) algorithm proposed by Billor, Hadi, and
Velleman (2000). bacon is a simple modication of the methodology proposed by Hadi
(1992, 1994) and implemented in hadimvo, but bacon is much less computationally in-
tensive. As a result, bacon runs many times faster than hadimvo, even though both
commands end up with similar sets of outliers. Identifying multivariate outliers thus
becomes fast and easy in Stata, even with large datasets of tens of thousands of obser-
vations.
c 2010 StataCorp LP st0197
332 Multivariate outliers detection
2 The BACON algorithm
The BACON algorithm was proposed by Billor, Hadi, and Velleman (2000). The reader
who is interested in details is referred to that original article, because only a brief
presentation is provided here.
In step 1, an initial subset of m outlier-free observations has to be identied out of
a sample of n observations and over p variables. Any of several distance measures could
be used as a criterion, and the Mahalanobis distance seems especially well adapted.
It possesses the desirable property of being scale-invarianta great advantage when
dealing with variables of dierent magnitudes or with dierent units. The Mahalanobis
distance of a p-dimensional vector x
i
= (x
i1
, x
i2
, . . . , x
ip
)
T
from a group of values with
mean x = (x
1
, x
2
, . . . , x
p
)
T
and covariance matrix S is dened as
d
i
(x, S) =
_
(x
i
x)
T
S
1
(x
i
x) , i = 1, 2, . . . , n
The initial basic subset is given by the m observations with the smallest Mahalanobis
distances from the whole sample. The subset size m is given by the product of the
number of variables p and a parameter chosen by the analyst.
Billor, Hadi, and Velleman (2000) also proposed using distances from the medians
for this rst step. This second version of the algorithm is also implemented in bacon.
Distances from the medians are not scale-invariant, so they should be used carefully if
the variables analyzed are of dierent magnitudes.
In step 2, Mahalanobis distances from the basic subset are computed:
d
i
(x
b
, S
b
) =
_
(x
i
x
b
)
T
S
1
b
(x
i
x
b
) , i = 1, 2, . . . , n (1)
In step 3, all observations with a distance smaller than some thresholda corrected
percentile of a
2
distributionare added to the basic subset.
Steps 2 and 3 are iterated until the basic subset no longer changes. Observations
excluded from the nal basic subset are nominated as outliers, whereas those inside the
nal basic subset are nonoutliers.
The dierence in the algorithm proposed by Hadi (1992, 1994) is that observa-
tions are added by blocks in the basic subset instead of observation by observation.
Thus some time is spared through a reduction of the number of iterations. Neverthe-
less, it is important to note that the performance of the algorithm is not altered, as
Billor, Hadi, and Velleman (2000) and section 5 of this article show.
The reduction in the number of iterations is not the only source of eciency gain.
Another major improvement lies in the way bacon is coded. When hadimvo was im-
plemented, Mata did not exist. Now, though, Mata provides signicant speed enhance-
ments to many computationally intensive tasks, like the calculation of Mahalanobis
distances. I therefore coded bacon so that it benets from Matas power.
S. Weber 333
3 Why Mata matters for bacon
The bacon command uses Mata, the matrix programming language available in Stata
since version 9. I explain here how Mata allows bacon to run very fast. This section
draws heavily on Baum (2008), who oers a general overview of Matas capabilities.
The BACON algorithm requires creating matrices from data, computing the distances
using (1), and converting the new matrix-containing distances back into the data. Op-
erations that convert Stata variables into matrices (or vice versa) require at least twice
the memory needed for that set of variables, so it stands to reason that using Statas
matrices would consume a lot of memory. On the other hand, Matas matrices are only
views of, not copies of, data. Hence, using Matas virtual matrices instead of Statas
matrices in bacon spares memory that can be used to run the computations faster.
Moreover, Statas matrices are unsuited for holding large amounts of data, their
maximal size being 11,000 11,000. Using Stata, it would not be possible to create
a matrix X = (x
1
, x
2
, . . . , x
i
, . . . , x
n
)
T
containing all observations of the database if
the n were larger than 11,000. One would thus have to cut the X matrix into pieces
to compute the distances in (1), which is obviously inconvenient. Mata circumvents
the limitations of Statas traditional matrix commands, thus allowing the creation of
virtually innite matrices (over 2 billion rows and columns). Thanks to Mata, I am
thus able to create a single matrix X containing all observations to whatever n. I then
use the powerful element-by-element operations available to compute the distances.
Mata is indeed ecient for handling element-by-element operations, whereas Stata
ado-le code written in the matrix language with explicit subscript references is slow.
Because the distances in (1) have to be computed for each individual at each iteration
of the algorithm, this feature of Mata provides another important eciency gain.
4 The bacon command
4.1 Syntax
The syntax of bacon is as follows:
bacon varlist
_
if
_
in

, generate(newvar1
_
newvar2

)
_
replace
percentile(#) version(1 | 2) c(#)

4.2 Options
generate(newvar1
_
newvar2

) is required; it identies the new variable(s) that will be


created. Whether you specify two variables or one, however, is optional. newvar2, if
specied, will contain the distances from the nal basic subset. That is, specifying
generate(out) creates a dummy variable out containing 1 if the observation is
an outlier in the BACON sense and 0 otherwise. Specifying generate(out dist)
334 Multivariate outliers detection
additionally creates a variable dist containing the distances from the nal basic
subset.
replace species that the variables newvar1 and newvar2 be replaced if they already
exist in the database. This option makes it easier to run bacon several times on
the same data. It should be used cautiously because it might denitively drop some
data.
percentile(#) determines the 1 # percentile of the chi-squared distribution to be
used as a threshold to separate outliers from nonoutliers. A larger # identies a
larger proportion of the sample as outliers. The default is percentile(0.15). If #
is specied greater than 1, it is interpreted as a percent; thus percentile(15) is
the same as percentile(0.15).
version(1 | 2) species which version of the BACON algorithm must be used to identify
the initial basic subset in multivariate data. version(1), the default, identies the
initial subset selected based on Mahalanobis distances. version(2) identies the ini-
tial subset selected based on distances from the medians. In the case of version(2),
varlist must not contain missing values, and you must install the moremata command
before running bacon.
c(#) is the parameter that determines the size of the initial basic subset, which is given
by the product of # and the number of variables in varlist. # must be an integer.
c(4) is used by default as proposed by Billor, Hadi, and Velleman (2000, 285).
4.3 Saved results
bacon saves the following results in r():
Scalars
r(outlier) number of outliers r(iter) number of iterations
r(corr) correction factor r(chi2) percentile of the
2
distribution
5 bacon versus hadimvo
Let us now compare bacon and hadimvo considering two criteria: i) the set of observa-
tions identied as outliers and ii) the speed. We will see that both commands lead to
similar outcomes (providing some tuning of the cuto parameters) but that hadimvo is
terribly slower. bacon thus outperforms hadimvo and should be preferred in any case.
First, let us use auto.dta to illustrate the similarity of the results obtained through
both commands:
S. Weber 335
. webuse auto
(1978 Automobile Data)
. hadimvo weight length, generate(outhadi) percentile(0.05)
(output omitted )
. bacon weight length, generate(outbacon) percentile(0.15)
(output omitted )
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total
0 72 0 72
1 0 2 2
Total 72 2 74
Both commands have identied the same two observations as outliers. The param-
eter (in the percentile() option) was set higher in bacon than in hadimvo. With a
parameter of 5%, bacon would not have identied any observation as an outlier. It is
the role of the researcher to choose the parameter that is best adapted for each dataset,
but the default percentile(0.15) appears to bring sensible outcomes in any case and
could always be used as a rst benchmark.
With two-dimensional data, it is helpful to draw a scatterplot such as gure 1 that
allows us to see where outliers are located:
. scatter weight length, ml(outbacon) ms(i) note("0 = nonoutlier, 1 = outlier")
0
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
0
0
0
0
00
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
2
,
0
0
0
3
,
0
0
0
4
,
0
0
0
5
,
0
0
0
W
e
i
g
h
t

(
l
b
s
.
)
140 160 180 200 220 240
Length (in.)
0 = nonoutlier, 1 = outlier
Figure 1. Scatterplot locating the observations identied as outliers
336 Multivariate outliers detection
To compare the speeds of bacon and hadimvo, let us now use a larger dataset.
Containing about 28,000 observations, nlswork.dta is suciently large to illustrate
the point. Suppose we want to identify outliers with respect to the variables ln wage,
age, and tenure. If we did not have bacon, we would type
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2:
At this point, your screen will remain idle. You might become worried and think
your computer crashed, but in fact hadimvo is simply going to take some long minutes
to run its many iterations. Remember, there are only 28,000 observations in this
dataset. If you are patient enough, Stata will at last show you the outcome:
. hadimvo ln_wage age tenure, generate(outhadi) percentile(0.05)
Beginning number of observations: 28101
Initially accepted: 4
Expand to (n+k+1)/2: 14052
Expand, p = .05: 28081
Outliers remaining: 20
Thanks to bacon, you now have a faster alternative. If you type
. bacon ln_wage age tenure, generate(outbacon) percentile(0.15)
Total number of observations: 28101
BACON outliers (p = 0.15): 29
Non-outliers remaining: 28072
the solution appears in only a few seconds! Again we can check that the set of identied
outliers is pretty much the same in the two cases:
. tabulate outhadi outbacon
Hadi
outlier BACON outlier (p=.15)
(p=.05) 0 1 Total
0 28,072 9 28,081
1 0 20 20
Total 28,072 29 28,101
Given the time hadimvo needs and the similarities between the outcomes, it seems
clear that bacon is preferable.
Because there is no rule for the choice of percentile(), the practitioner might
legitimately be willing to test several values and decide after several trials which set of
observations to nominate as outliers. With hadimvo, such an iterative process is almost
impracticable, unless you are particularly patient and have enough time in front of
you. With bacon, on the other hand, completing the iterative process becomes readily
feasible.
S. Weber 337
bacon has a replace option precisely to give the possibility of running the algorithm
several times without having to add a new variable at each iteration. For the user
wanting to try several percentile() values, replace will prove convenient:
. bacon ln_wage age tenure, generate(outbacon) percentile(0.1)
outbacon already defined
r(110);
. bacon ln_wage age tenure, generate(outbacon) percentile(0.1) replace
Total number of observations: 28101
BACON outliers (p = 0.10): 6
Non-outliers remaining: 28095
. bacon ln_wage age tenure, generate(outbacon) percentile(0.2) replace
Total number of observations: 28101
BACON outliers (p = 0.20): 160
Non-outliers remaining: 27941
6 Conclusion
The two big questions about outliers are how do you nd them? and what do you
do about them? (Ord 1996). The bacon command presented here provides an answer
to the rst of these questions. The answer to the second is beyond the scope of this
article and is left to the consideration of the researcher.
No doubt, bacon renders the process of detecting outliers in multivariate data easier.
Compared with hadimvo, the only other command devoted to this task in Stata, bacon
appears to identify a similar set of observations as outliers. In terms of speed, bacon
proves to be far faster. Hence, there is no apparent reason to use hadimvo instead of
bacon.
Even though the bacon command provides a fast and easy way to identify potential
outliers, a certain amount of judgment is always needed when deciding which cases to
nominate as outliers and what to do with those observations. Most researchers simply
discard outliers, but before you do so, keep in mind that something new and useful can
often be learned by looking at the nominated cases.
7 References
Barnett, V., and T. Lewis. 1994. Outliers in Statistical Data. 3rd ed. Chichester, UK:
Wiley.
Baum, C. F. 2008. Using Mata to work more eectively with Stata: A tutorial. UK
Stata Users Group meeting proceedings.
http://ideas.repec.org/p/boc/usug08/11.html.
Billor, N., A. S. Hadi, and P. F. Velleman. 2000. BACON: Blocked adaptive computa-
tionally ecient outlier nominators. Computational Statistics & Data Analysis 34:
279298.
338 Multivariate outliers detection
Hadi, A. S. 1992. Identifying multiple outliers in multivariate data. Journal of the Royal
Statistical Society, Series B 54: 761771.
. 1994. A modication of a method for the detection of outliers in multivariate
samples. Journal of the Royal Statistical Society, Series B 56: 393396.
Ord, K. 1996. Review of Outliers in Statistical Data, 3rd ed., by V. Barnett and T.
Lewis. International Journal of Forecasting 12: 175176.
About the author
Sylvain Weber is working as a teaching assistant in the Department of Economics at the
University of Geneva in Switzerland. He is pursuing a PhD in the eld of human capital
depreciation, wage growth over the career, and job stability.
The Stata Journal (2010)
10, Number 3, pp. 339358
Comparing the predictive powers of survival
models using Harrells C or Somers D
Roger B. Newson
National Heart and Lung Institute
Imperial College London
London, UK
r.newson@imperial.ac.uk
Abstract. Medical researchers frequently make statements that one model pre-
dicts survival better than another, and they are frequently challenged to provide
rigorous statistical justication for those statements. Stata provides the estat
concordance command to calculate the rank parameters Harrells C and Somers D
as measures of the ordinal predictive power of a model. However, no condence
limits or p-values are provided to compare the predictive power of distinct models.
The somersd package, downloadable from Statistical Software Components, can
provide such condence intervals, but they should not be taken seriously if they are
calculated in the dataset in which the model was t. Methods are demonstrated
for tting alternative models to a training set of data, and then measuring and
comparing their predictive powers by using out-of-sample prediction and somersd
in a test set to produce statistically sensible condence intervals and p-values for
the dierences between the predictive powers of dierent models.
Keywords: st0198, somersd, stcox, estat concordance, streg, predict, survival,
model validation, prediction, concordance, rank methods, Harrells C, Somers D
1 Introduction
Harrells C and the equivalent parameter Somers D were proposed as measures of
the general predictive power of a general regression model by Harrell et al. (1982) and
Harrell, Lee, and Mark (1996), who focused attention on the case of a survival model
with a possibly right-censored outcome, which was interpreted as a lifetime. In the case
of a Cox proportional hazards regression model, both parameters are output by the
Stata postestimation command estat concordance (see [ST] stcox postestimation).
1
However, because Harrells C and Somers D are rank parameters, they are equally valid
as measures of the predictive power of any model in which the scalar outcome Y is at
least ordinal (with or without censorship), and in which the conditional distribution
of the outcome, given the predictor variables, is governed by a scalar function of the
predictor variables and the parameters, such as the hazard ratio in a Cox regression or
the linear predictor in a generalized linear model. If the assumptions of the model are
true, then such a scalar predictive score plays the role of a balancing score as dened
by Rosenbaum and Rubin (1983).
1. As of Stata 11.1, estat concordance provides two concordance measures: Harrells C and Gonen
and Hellers K. Harrells C is computed by default or if harrell is specied.
c 2010 StataCorp LP st0198
340 Comparing the predictive powers of survival models
Harrells C and Somers D are members of the Kendall family of rank parameters.
The family history can be summarized as follows: Kendalls
a
begat Somers D begat
TheilSen percentile slopes. This family is implemented in Stata by using the somersd
package, which can be downloaded from Statistical Software Components. An overview
of the parameter family is given in Newson (2002), and the methods and formulas are
given in detail in Newson (2006a,b,c).
Parameters in this family are dened by assuming the existence of a population of
bivariate data pairs of the form (X
i
, Y
i
) and a sampling scheme for sampling pairs of
pairs {(X
i
, Y
i
), (X
j
, Y
j
)} from that population. A pair of pairs is said to be concordant
if the larger of the X values is paired with the larger of the Y values, and a pair is
said to be discordant if the larger of the X values is paired with the smaller of the
Y values. Kendalls
a
is the dierence between the probability of concordance and
the probability of discordance. Somers D(X| Y ) is the dierence between the cor-
responding conditional probabilities, assuming that the two Y values can be ordered.
Harrells C(X| Y ) is dened as {D(X| Y ) +1}/2 and is equal to the conditional proba-
bility of concordance plus half the conditional probability that the data pairs are neither
concordant nor discordant, assuming that the two Y values can be ordered. In the case
where Y is an outcome to be predicted by a multivariate model with a scalar predictive
score, there is an underlying population of multivariate data points (Y
i
, V
i1
, . . . , V
ik
)
where the V
ih
are predictive covariates and the role of the X
i
is played by the scalar
predictive score (V
i1
, . . . , V
ik
). In this case, the Somers D and Harrells C parameters
can be denoted as D{(V
1
, . . . , V
k
) | Y } and C{(V
1
, . . . , V
k
) | Y }, respectively. If the
model is a survival model, then the Y values are lifetimes, and there is the possibility
that one or both of a pair of Y values may be censored, which sometimes implies that
they cannot be ordered.
We often want to compare the predictive powers of alternative predictors of the same
outcome Y . Newson (2002, 2006b) argues that if there is an underlying population of
trivariate data points (W
i
, X
i
, Y
i
) and if any positive association between the Y
i
and
the X
i
is caused by a positive association of both of these variables with the W
i
, then
we must have the inequality D(X| Y ) D(W | Y ) 0 or, equivalently, C(X| Y )
C(W | Y ) = {D(X| Y ) D(W | Y )}/2 0. This inequality still holds if the Y variable
may be censored, but not if the W or X variable may be censored. This implies that if
we have multiple alternative positive predictors of the same outcome, such as alternative
predictive scores from alternative multivariate models, then it may be useful to calculate
condence intervals for the dierences between the Somers D or Harrells C parameters
of these predictors with respect to the outcome, and then to make statements regarding
which predictors may or may not be secondary to which other predictors. In Stata,
this can be done by using lincom after the somersd command, as demonstrated in
section 4.1 of Newson (2002).
Medical researchers frequently make statements that one model predicts survival
better than another. Statistical referees acting for medical journals frequently challenge
the researchers to provide rigorous statistical justication for these statements. The
Stata postestimation command estat concordance provides estimates of Harrells C
and Somers D but provides no condence limits for these, nor any condence limits or
R. B. Newson 341
p-values for the dierences between the values of these rank parameters from dierent
models. This is the case for good reason: condence-interval formulas do not protect the
user for nding a model in the same data in which its parameters are then estimated.
Used sequentially, the somersd and lincom commands provide condence limits and p-
values for dierences between the Somers D or Harrells C parameters between dierent
predictors. However, not all medical researchers know how to calculate a condence
interval (CI) when the predictors are scalar predictive scores from models, and fewer
still know how to do so in such a way that the condence limits can be taken seriously.
In this article, I aim to explain how medical researchers can calculate CIs and preempt
possible queries that may arise in the process.
The remainder of this article is divided into four sections. Section 2 addresses
the queries that commonly arise when users try to duplicate the results of estat
concordance using somersd. Section 3 describes the method of splitting the data into
a training set (to which models are t) and a test set (in which their predictive powers
are measured). Section 4 describes the extension to non-Cox survival models, such as
those described in [ST] streg. Finally, section 5 briey explains how the methods can
be extended even further.
2 The Cox model: somersd versus estat concordance
I will demonstrate the principles using the Cox proportional hazards model, which is
implemented in Stata using the stcox command (see [ST] stcox). I also use the Stanford
drug-trial dataset, which is used for the examples in [ST] stcox postestimation.
Before I raise the issue of condence limits, we need to see how somersd can pro-
duce the same estimates as estat concordance. This is done using predict after the
survival estimation command to dene the predictive score, and then using somersd to
measure the association of the predictive score with the lifetime. Users who attempt
to use somersd to duplicate the estimates of estat concordance may face confusion
caused by these three issues:
1. The predict command, used after stcox, by default produces a negative predic-
tion score, in contrast to the positive prediction score produced by using predict
after most estimation commands.
2. The default coding of a censorship status variable for stcox is dierent from the
coding of a censorship status variable for somersd.
3. The treatment of tied failure times by estat concordance is dierent from that
used by somersd.
There are solutions to all of these problems, and I will demonstrate them, enabling
users to use somersd and estat concordance as checks on one another.
Lets start the demonstration by inputting the Stanford drug-trial data, tting a
Cox model, and calling estat concordance:
342 Comparing the predictive powers of survival models
. use http://www.stata-press.com/data/r11/drugtr
(Patient Survival in Drug Trial)
. stset
-> stset studytime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, studytime]
exit on or before: failure
48 total obs.
0 exclusions
48 obs. remaining, representing
31 failures in single record/single failure data
744 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 39
. stcox drug age
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -99.911448
Iteration 1: log likelihood = -83.551879
Iteration 2: log likelihood = -83.324009
Iteration 3: log likelihood = -83.323546
Refining estimates:
Iteration 0: log likelihood = -83.323546
Cox regression -- Breslow method for ties
No. of subjects = 48 Number of obs = 48
No. of failures = 31
Time at risk = 744
LR chi2(2) = 33.18
Log likelihood = -83.323546 Prob > chi2 = 0.0000
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .1048772 .0477017 -4.96 0.000 .0430057 .2557622
age 1.120325 .0417711 3.05 0.002 1.041375 1.20526
. estat concordance
Harrells C concordance statistic
failure _d: died
analysis time _t: studytime
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrells C = (E + T/2) / P = .8086
Somers D = .6172
The stset command shows us that the input dataset has already been set up as
a survival-time dataset that includes one observation per drug-trial subject as well as
data on survival time and termination modes, among other things (see [ST] stset).
The Cox model contains two predictive covariates, age (subject age in years) and drug
(indicating treatment group, with a value of 0 for placebo and a value of 1 for the
R. B. Newson 343
drug being tested). We then see that, according to estat concordance, Harrells C is
0.8086 and Somers D is 0.6172. The Somers D implies that when one of two subjects is
observed to survive another, the model predicts that the survivor is 61.72% more likely
to have a lower hazard ratio than the nonsurvivor. The Harrells C is the probability
that the survivor has the lower hazard ratio plus half the (possibly negligible) probability
that the two subjects have equal hazard ratios, and this sum is 80.86% on a percentage
scale.
We will now see how to duplicate these estimates by using predict and somersd.
We start by dening a negative predictor of lifetime by using predict to calculate a
hazard ratio. We then derive an inverse hazard ratio, which we expect to be a positive
predictor of lifetime:
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr
This strategy addresses the rst of the three sources of confusion mentioned before.
Addressing the second source of confusion, we need to dene a censorship indicator
for input to the somersd command. The somersd command has a cenind() option
that requires a list of censorship indicators. These censorship indicators are allocated
one-to-one to the corresponding variables of the variable list input to somersd and must
be either variable names or zeros (implying a censorship indicator variable whose values
are all zero). Censorship indicator variables for somersd are positive in observations
where the corresponding input variable value is right-censored (or known to be equal to
or greater than its stated value), are negative in observations where the corresponding
input variable value is left-censored (or known to be equal to or less than its stated
value), and are zero in observations where the corresponding input variable value is
uncensored (or known to be equal to its stated value). If the list of censorship indicators
is shorter than the input variable list, then the list of censorship indicators is extended
on the right with zeros, implying that the variables without censorship indicators are
uncensored.
This coding scheme is not the same as that for the censorship indicator variable d
that is created by the stset command, which is 1 in observations where the correspond-
ing lifetime is uncensored and is 0 in observations where the corresponding lifetime is
right-censored.
To convert an stset censorship indicator variable to a somersd censorship indicator
variable, we use the command
. generate censind=1-_d if _st==1
This command creates a new variable, censind, which assumes the following values:
missing in observations excluded from the survival sample, as indicated by the variable
st created by stset; 1 in observations with right-censored lifetimes (where d is 0);
and 0 in observations with uncensored lifetimes (where d is 1).
344 Comparing the predictive powers of survival models
We can now use somersd to calculate Harrells C and Somers D, using the transf(c)
option for Harrells C and the transf(z) option (indicating the normalizing and vari-
ance-stabilizing Fishers z or hyperbolic arctangent transformation) for Somers D:
. somersd _t invhr if _st==1, cenind(censind) tdist transf(c)
Somers D with variable: _t
Transformation: Harrells c
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for Harrells c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr .8106332 .0423076 19.16 0.000 .7255213 .8957451
. somersd _t invhr if _st==1, cenind(censind) tdist transf(z)
Somers D with variable: _t
Transformation: Fishers z
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for transformed Somers D
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr .7270649 .1378034 5.28 0.000 .4498402 1.00429
Asymmetric 95% CI for untransformed Somers D
Somers_D Minimum Maximum
invhr .62126643 .42176765 .76338983
In both cases, we use the survival-time variable t, the survival sample indicator
st (created by stset), and the inverse hazard rate invhr (created using predict)
to estimate rank parameters of the inverse hazard ratio with respect to survival time
(censored by censorship status). In the case of Harrells C, the estimated parameter
is on a scale from 0 to 1 and is expected to be at least 0.5 for a positive predictor of
lifetime, such as an inverse hazard ratio. In the case of Somers D, the untransformed
parameter is on a scale from 1 to 1 and is expected to be at least 0 for a positive
predictor of lifetime.
However, we now encounter the third source of confusion mentioned before. If we
compare the estimates here to those produced earlier by estat concordance, we nd
that the estimates for Harrells C and Somers D are similar but not exactly the same.
The estimates are 0.8106 and 0.6213, respectively, when computed by somersd, and
0.8086 and 0.6172, respectively, when computed by estat concordance. The reason
for this discrepancy is that somersd and estat concordance have dierent policies
for comparing two lifetimes that terminate simultaneously when one lifetime is right-
censored and the other is uncensored. The estat concordance policy assumes that
the owner of the right-censored lifetime survived the owner of the uncensored lifetime,
whereas the somersd policy assumes that neither of the two owners can be said to have
survived the other. In the case of a drug trial, one subject might be known to have
R. B. Newson 345
died in a certain month, whereas another might be known to have left the country in
the same month and has therefore become lost to follow-up. The estat concordance
policy assumes that the second subject must have survived the rst, which might be
probable, given that this second subject seems to have been in a t state to travel out
of the country. The somersd policy, more cautiously, allows the possibility that the
second subject may have left the country early in the month and died unexpectedly of
a venous thromboembolism on the outbound plane, whereas the rst subject may have
died under observation of the trial organizers later in the same month.
Whatever the merits of the two policies, we might still like to show that somersd
and estat concordance can be made to duplicate one anothers estimates. This can
easily be done if lifetimes are expressed as whole numbers of time units, as they are
in the Stanford drug trial data, where lifetimes are expressed in months. In this case,
we can add half a unit to right-censored lifetimes only. As a result, right-censored
lifetimes become greater than uncensored lifetimes terminating within the same time
unit without aecting any other orderings of lifetimes.
In our example, we do this by generating a new lifetime variable, studytime2, that
is equal to the modied survival time. We then use stset to reset the various survival-
time variables and characteristics so that the modied survival time is now used. This
step is done after using the assert command to check that the old studytime variable
is indeed integer-valued; see [D] assert and [D] functions. We then proceed as in the
previous example:
. use http://www.stata-press.com/data/r11/drugtr, clear
(Patient Survival in Drug Trial)
. assert studytime==int(studytime)
. generate studytime2=studytime+0.5*(died==0)
. stset studytime2, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, studytime2]
exit on or before: failure
48 total obs.
0 exclusions
48 obs. remaining, representing
31 failures in single record/single failure data
752.5 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 39.5
(Continued on next page)
346 Comparing the predictive powers of survival models
. stcox drug age
failure _d: died
analysis time _t: studytime2
Iteration 0: log likelihood = -99.911448
Iteration 1: log likelihood = -83.551879
Iteration 2: log likelihood = -83.324009
Iteration 3: log likelihood = -83.323546
Refining estimates:
Iteration 0: log likelihood = -83.323546
Cox regression -- Breslow method for ties
No. of subjects = 48 Number of obs = 48
No. of failures = 31
Time at risk = 752.5
LR chi2(2) = 33.18
Log likelihood = -83.323546 Prob > chi2 = 0.0000
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .1048772 .0477017 -4.96 0.000 .0430057 .2557622
age 1.120325 .0417711 3.05 0.002 1.041375 1.20526
. estat concordance
Harrells C concordance statistic
failure _d: died
analysis time _t: studytime2
Number of subjects (N) = 48
Number of comparison pairs (P) = 849
Number of orderings as expected (E) = 679
Number of tied predictions (T) = 15
Harrells C = (E + T/2) / P = .8086
Somers D = .6172
. predict hr
(option hr assumed; relative hazard)
. generate invhr=1/hr
. generate censind=1-_d if _st==1
. somersd _t invhr if _st==1, cenind(censind) tdist transf(c)
Somers D with variable: _t
Transformation: Harrells c
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for Harrells c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr .8085984 .0425074 19.02 0.000 .7230845 .8941122
R. B. Newson 347
. somersd _t invhr if _st==1, cenind(censind) tdist transf(z)
Somers D with variable: _t
Transformation: Fishers z
Valid observations: 48
Degrees of freedom: 47
Symmetric 95% CI for transformed Somers D
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr .7204641 .1373271 5.25 0.000 .4441976 .9967306
Asymmetric 95% CI for untransformed Somers D
Somers_D Minimum Maximum
invhr .6171967 .41711782 .76021766
This time, the model t produces the same output as before, and the command
estat concordance produces the same estimates as it did before of 0.8086 and 0.6172
for Harrells C and Somers D, respectively. But now the same estimates of 0.8086 and
0.6172 are also produced by somersd, at least after rounding to four decimal places.
It should be stressed that Harrells C and Somers D, computed as above either
by somersd or by estat concordance, are valid measures of the predictive power of a
survival model only if there are no time-dependent covariates or lifetimes with delayed
entries. However, if somersd (instead of estat concordance) is used, then sensible
estimates can still be produced with weighted data, so long as those weights are explicitly
supplied to somersd.
3 Comparing predictive powers with training and test
sets
Another caution about the results of the previous section is that the condence intervals
generated by somersd should not really be taken seriously. This is because, in general,
condence intervals do not protect the user against the consequences of nding a model
in a dataset and then estimating its parameters in the same dataset. In the case of
Harrells C and Somers D of inverse hazard ratios with respect to lifetime, we would
expect this incorrect practice to lead to overly optimistic estimates of predictive power
because we are measuring the predictive power of a model that is optimized for the
dataset in which the predictive power is measured.
We really should be nding models in a training set of data and testing the models
predictive powers, both absolute and relative to each other, in a test set of data that
is independent of the training set. If we have only one set of data, we might divide
its primary sampling units (randomly or semirandomly) into two subsets, and use the
rst subset as the training set and the second subset as the test set. Sections 3.1
and 3.2 below demonstrate this practice by splitting the Stanford drug-trial data into
a training set and a test set of similar sizes, using random subsets and semirandom
348 Comparing the predictive powers of survival models
stratied subsets, respectively. We will use the somersd policy, rather than the estat
concordance policy, regarding tied censored and noncensored lifetimes.
3.1 Completely random training and test sets
We will rst demonstrate the relatively simple practice of splitting the sampling units,
completely at random, into a training set and a test set. We will t three models to the
training set: model 1, containing the variables drug and age; model 2, containing drug
only; and model 3, containing age only. Next we will use out-of-sample prediction and
somersd to estimate the predictive powers of these three models in the test set. We
will then use lincom to compare their predictive powers, in the manner of section 5.2
of Newson (2006b).
We start by inputting the data and then splitting them, completely at random, into
a training set and a test set. We use the runiform() function to create a uniformly
distributed pseudorandom variable, sort to sort the dataset by this variable, and the
mod() function to allocate alternate observations to the training and test sets (see
[D] sort and [D] functions). We then re-sort the data back to their old order using the
generated variable oldord.
. use http://www.stata-press.com/data/r11/drugtr, clear
(Patient Survival in Drug Trial)
. set seed 987654321
. generate ranord=runiform()
. generate long oldord=_n
. sort ranord, stable
. generate testset=mod(_n,2)
. sort oldord
. tabulate testset, m
testset Freq. Percent Cum.
0 24 50.00 50.00
1 24 50.00 100.00
Total 48 100.00
We see that there are 24 patient lifetimes in the training set (where testset==0)
and 24 in the test set (where testset==1). We then t the three Cox models to the
training set and create the inverse hazard-rate variables invhr1, invhr2, and invhr3
for models 1, 2 and 3, respectively:
R. B. Newson 349
. stcox drug age if testset==0
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -30.207704
Iteration 2: log likelihood = -30.075862
Iteration 3: log likelihood = -30.075741
Refining estimates:
Iteration 0: log likelihood = -30.075741
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(2) = 13.65
Log likelihood = -30.075741 Prob > chi2 = 0.0011
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .1302894 .085747 -3.10 0.002 .0358683 .473269
age 1.139011 .0678588 2.18 0.029 1.013482 1.280089
. predict hr1
(option hr assumed; relative hazard)
. generate invhr1=1/hr1
. stcox drug if testset==0
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -32.692209
Iteration 2: log likelihood = -32.647379
Iteration 3: log likelihood = -32.647309
Refining estimates:
Iteration 0: log likelihood = -32.647309
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(1) = 8.51
Log likelihood = -32.647309 Prob > chi2 = 0.0035
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .1843768 .112761 -2.76 0.006 .0556069 .611341
. predict hr2
(option hr assumed; relative hazard)
. generate invhr2=1/hr2
(Continued on next page)
350 Comparing the predictive powers of survival models
. stcox age if testset==0
failure _d: died
analysis time _t: studytime
Iteration 0: log likelihood = -36.900079
Iteration 1: log likelihood = -35.587135
Iteration 2: log likelihood = -35.58462
Refining estimates:
Iteration 0: log likelihood = -35.58462
Cox regression -- Breslow method for ties
No. of subjects = 24 Number of obs = 24
No. of failures = 14
Time at risk = 370
LR chi2(1) = 2.63
Log likelihood = -35.58462 Prob > chi2 = 0.1048
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
age 1.082178 .0526849 1.62 0.105 .9836912 1.190526
. predict hr3
(option hr assumed; relative hazard)
. generate invhr3=1/hr3
The variables invhr1, invhr2, and invhr3 are dened for all observations, both in
the training set and in the test set. We then dene the censorship indicator, as before,
and estimate the Harrells C indexes in the test set for all three models t to the training
set:
. generate censind=1-_d if _st==1
. somersd _t invhr1 invhr2 invhr3 if _st==1 & testset==1, cenind(censind) tdist
> transf(c)
Somers D with variable: _t
Transformation: Harrells c
Valid observations: 24
Degrees of freedom: 23
Symmetric 95% CI for Harrells c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr1 .8819444 .0490633 17.98 0.000 .7804493 .9834396
invhr2 .7916667 .0330999 23.92 0.000 .7231944 .860139
invhr3 .6365741 .0831046 7.66 0.000 .4646592 .808489
We see that Harrells C of inverse hazard ratio with respect to lifetime is 0.8819 for
model 1 (using both drug treatment and age), 0.7917 for model 2 (using drug treatment
only), and 0.6366 for model 3 (using age only). All of these estimates have condence
limits, which are probably less unreliable than the ones we saw in the previous sec-
tion. However, the sample Harrells C is likely to have a skewed distribution in the
presence of such strong positive associations, for the same reasons as Kendalls
a
(see
Daniels and Kendall [1947]). Dierences between Harrells C indexes are likely to have
R. B. Newson 351
a less-skewed sampling distribution and are also what we probably really wanted to
know. We estimate these dierences with lincom, as follows:
. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .0902778 .0350965 2.57 0.017 .0176751 .1628804
. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .2453704 .0736766 3.33 0.003 .0929586 .3977821
. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .1550926 .0823647 1.88 0.072 -.0152917 .3254769
Model 1 seems to have a slightly higher predictive power than model 2 or (especially)
model 3, while the dierence between model 2 and model 3 is slightly less convincing.
We can also do the same comparison using Somers D rather than Harrells C, by using
the normalizing and variance-stabilizing z transform, recommended by Edwardes (1995)
and implemented using the somersd option transf(z). In that case, the dierences
between the predictive powers of the dierent models will be expressed in z units (not
shown).
3.2 Stratied semirandom training and test sets
Completely random training and test sets may have the disadvantage that, by chance,
important predictor variables may have dierent sample distributions in the training
and test sets, making both the training set and the test set less representative of the
sample as a whole and of the total population from which the training and test sets were
sampled. We might feel safer if we chose the training and test sets semirandomly, with
the constraint that the two sets have similar distributions of key predictor variables in
the various models.
In our case, we might want to ensure that both the training set and the test set
contain their fair share of drug-treated older subjects, drug-treated younger subjects,
placebo-treated older subjects, and placebo-treated younger subjects. To ensure this,
we might start by dening sampling strata that are combinations of treatment status
and age group, and split each of these strata as evenly as possible between the training
set and the test set. Again, this requires the dataset to be sorted, and we will afterward
352 Comparing the predictive powers of survival models
sort it back to its original order. We sort as follows, using the xtile command to dene
age groups (see [D] pctile):
. use http://www.stata-press.com/data/r11/drugtr, clear
(Patient Survival in Drug Trial)
. set seed 987654321
. generate ranord=runiform()
. generate long oldord=_n
. xtile agegp=age, nquantiles(2)
. tabulate drug agegp, m
Drug type
(0=placebo 2 quantiles of age
) 1 2 Total
0 11 9 20
1 16 12 28
Total 27 21 48
. sort drug agegp ranord, stable
. by drug agegp: generate testset=mod(_n,2)
. sort oldord
. table testset drug agegp, row col scol
2 quantiles of age and Drug type (0=placebo)
1 2 Total
testset 0 1 Total 0 1 Total 0 1 Total
0 5 8 13 4 6 10 9 14 23
1 6 8 14 5 6 11 11 14 25
Total 11 16 27 9 12 21 20 28 48
This time, the training set is slightly smaller than the test set because of odd total
numbers of subjects in sampling strata. We then carry out the model tting in the
training set and the calculation of inverse hazard ratios in both sets using the same
command sequence as with the completely random training and test sets, producing
mostly similar results, which are not shown. Finally, we estimate the Harrells C indexes
in the test set:
R. B. Newson 353
. generate censind=1-_d if _st==1
. somersd _t invhr1 invhr2 invhr3 if _st==1 & testset==1, cenind(censind) tdist
> transf(c)
Somers D with variable: _t
Transformation: Harrells c
Valid observations: 25
Degrees of freedom: 24
Symmetric 95% CI for Harrells c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
invhr1 .7911392 .0674598 11.73 0.000 .6519091 .9303694
invhr2 .7257384 .049801 14.57 0.000 .6229542 .8285226
invhr3 .5780591 .0972101 5.95 0.000 .3774274 .7786908
The C estimates for the three models are not dissimilar to the previous ones with
completely random training and test sets. Their pairwise dierences are as follows:
. lincom invhr1-invhr2
( 1) invhr1 - invhr2 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .0654008 .0491405 1.33 0.196 -.0360202 .1668219
. lincom invhr1-invhr3
( 1) invhr1 - invhr3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .2130802 .0763467 2.79 0.010 .0555084 .3706519
. lincom invhr2-invhr3
( 1) invhr2 - invhr3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .1476793 .1080388 1.37 0.184 -.0753017 .3706603
Model 1 (with drug treatment and age) still seems to predict better than model 3
(with age alone). This conclusion is similar if we compare the z-transformed Somers D
values, which are not shown.
4 Extensions to non-Cox survival models
Measuring predictive power using Harrells C and Somers D is not restricted to Cox
models, but can be applied to any model with a positive or negative ordinal predictor.
The streg command (see [ST] streg) ts a wide range of survival models, each of
which has a wide choice of predictive output variables, which can be computed using
354 Comparing the predictive powers of survival models
predict (see [ST] streg postestimation). These output variables may predict survival
times positively or negatively on an ordinal scale and may include median survival times,
mean survival times, median log survival times, mean log survival times, hazards, hazard
ratios, or linear predictors.
We will briey demonstrate the principles involved by tting Gompertz models to
the survival dataset that we used in previous sections. The Gompertz model assumes an
exponentially increasing (or decreasing) hazard rate, and the linear predictor is the log
of the zero-time baseline hazard rate, whereas the rate of increase (or decrease) in hazard
rate, after time zero, is a nuisance parameter. Therefore, if the Gompertz model is true,
then so is the Cox model. However, the argument of Fisher (1935) presumably implies
that if the Gompertz model is true, then we can be no less ecient, asymptotically, by
tting a Gompertz model instead of a Cox model. We will use the predicted median
lifetime as the positive predictor, whose predictive power will be assessed using somersd.
We start by inputting the cancer trial dataset and dening the stratied, semirandom
training and test sets, exactly as we did in section 3.2. We then t to the training
set Gompertz models 1, 2, and 3, containing, respectively, both drug treatment and
age, drug treatment only, and age only. After tting each of the three models, we
use predict to compute the predicted median survival time for the whole sample,
deriving the alternative positive lifetime predictors medsurv1, medsurv2, and medsurv3
for models 1, 2, and 3, respectively:
. streg drug age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(2) = 20.62
Log likelihood = -14.076214 Prob > chi2 = 0.0000
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .0948331 .0594575 -3.76 0.000 .0277512 .3240694
age 1.172588 .0616365 3.03 0.002 1.057798 1.299836
/gamma .1553139 .0430892 3.60 0.000 .0708605 .2397672
. predict medsurv1
(option median time assumed; predicted median time)
R. B. Newson 355
. streg drug if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(1) = 11.02
Log likelihood = -18.873214 Prob > chi2 = 0.0009
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
drug .153411 .0877048 -3.28 0.001 .0500295 .4704213
/gamma .1063648 .0361612 2.94 0.003 .0354901 .1772394
. predict medsurv2
(option median time assumed; predicted median time)
. streg age if testset==0, distribution(gompertz) nolog
failure _d: died
analysis time _t: studytime
Gompertz regression -- log relative-hazard form
No. of subjects = 23 Number of obs = 23
No. of failures = 15
Time at risk = 338
LR chi2(1) = 5.56
Log likelihood = -21.606438 Prob > chi2 = 0.0184
_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
age 1.117255 .0516156 2.40 0.016 1.020536 1.223142
/gamma .088458 .0341184 2.59 0.010 .0215871 .1553288
. predict medsurv3
(option median time assumed; predicted median time)
Unsurprisingly, the tted parameters are not dissimilar to the corresponding param-
eters for the Cox regression. We then compute the censorship indicator censind, and
then the Harrells C indexes, for the test set:
(Continued on next page)
356 Comparing the predictive powers of survival models
. generate censind=1-_d if _st==1
. somersd _t medsurv1 medsurv2 medsurv3 if _st==1 & testset==1, cenind(censind)
> tdist transf(c)
Somers D with variable: _t
Transformation: Harrells c
Valid observations: 25
Degrees of freedom: 24
Symmetric 95% CI for Harrells c
Jackknife
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
medsurv1 .7911392 .0674598 11.73 0.000 .6519091 .9303694
medsurv2 .7257384 .049801 14.57 0.000 .6229542 .8285226
medsurv3 .5780591 .0972101 5.95 0.000 .3774274 .7786908
We then compare the Harrells C parameters for the alternative median survival
functions, using lincom, just as before:
. lincom medsurv1-medsurv2
( 1) medsurv1 - medsurv2 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .0654008 .0491405 1.33 0.196 -.0360202 .1668219
. lincom medsurv1-medsurv3
( 1) medsurv1 - medsurv3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .2130802 .0763467 2.79 0.010 .0555084 .3706519
. lincom medsurv2-medsurv3
( 1) medsurv2 - medsurv3 = 0
_t Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .1476793 .1080388 1.37 0.184 -.0753017 .3706603
Unsurprisingly, the conclusions for the Gompertz model are essentially the same as
those for the Cox model.
R. B. Newson 357
5 Further extensions
The use of Harrells C and Somers D in test sets to compare the power of models
t to training sets can be extended further to nonsurvival regression models. In this
case, life is even simpler because we do not have to dene a censorship indicator such
as censind for input to somersd. The predictive score is still computed using out-of-
sample prediction and can be either the tted regression value or the linear predictor
(if one exists in the model).
The methods presented so far have the limitation that the Harrells C and Somers D
parameters that we calculated estimate only the ordinal predictive power (in the pop-
ulation from which the training and test sets were sampled) of the precise model that
we t to the training set. We might prefer to estimate the mean predictive power that
we can expect (in the whole universe of possible training and test sets) using the same
set of alternative models. Bootstrap-like methods for doing this, involving repeated
splitting of the same sample into training and test sets, are described in Harrell et al.
(1982) and Harrell, Lee, and Mark (1996).
Another limitation of the methods presented here, as mentioned at the end of sec-
tion 2, is that they should not usually be used with models with time-dependent co-
variates. This is because the predicted variable input to somersd, which the alternative
predictive scores are competing to predict, is the length of a lifetime rather than an
event of survival or nonsurvival through a minimal time interval, such as a day. A
predictor variable for such a lifetime must therefore stay constant, at least through that
lifetime, which rules out functions of continuously varying time-dependent covariates.
In Stata, survival-time datasets may have multiple observations for each subject
with a lifetime, representing multiple sublifetimes. Discretely varying time-dependent
covariates, which remain constant through a sublifetime, can also be included in such
datasets. somersd can therefore be used when these conditions are met: the model
is a Cox regression, the time-dependent covariates vary only discretely, the multiple
sublifetimes are the times spent by a subject in an age group, and each subject becomes
at risk at the start of each age group to which she or he survives. If the subject
identier variable is named subid, and the age group for each sublifetime is represented
by a discrete variable agegp, then the user may use somersd with cluster(subid)
funtype(bcluster) wstrata(agegp) to calculate Somers D or Harrells C estimates
restricted to comparisons between sublifetimes of dierent subjects in the same age
group. See Newson (2006b) for details of the options for somersd, and see [ST] stset
for details on survival-time datasets.
If the user has access to sucient data-storage space, then the age groups can be
dened nely (as subject-years or even subject-days), and the discretely time dependent
covariates might therefore be very nearly continuously time-dependent. Any training
sets or test sets in this case should, of course, be sets of subjects rather than sets of
lifetimes.
358 Comparing the predictive powers of survival models
6 Acknowledgments
I would like to thank Samia Mora, MD, of Partners HealthCare, for sending me the
query that prompted me to write this article. I also thank the many other Stata users
who have also contacted me over the past few years with essentially similar queries on
how to use somersd to compare the predictive powers of survival models.
7 References
Daniels, H. E., and M. G. Kendall. 1947. The signicance of rank correlation where
parental correlation exists. Biometrika 34: 197208.
Edwardes, M. D. 1995. A condence interval for Pr(X < Y ) Pr(X > Y ) estimated
from simple cluster samples. Biometrics 51: 571578.
Fisher, R. A. 1935. The logic of inductive inference. Journal of the Royal Statistical
Society 98: 3982.
Harrell, F. E., Jr., R. M. Cali, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982.
Evaluating the yield of medical tests. Journal of the American Medical Association
247: 25432546.
Harrell, F. E., Jr., K. L. Lee, and D. B. Mark. 1996. Multivariable prognostic models:
Issues in developing models, evaluating assumptions and adequacy, and measuring
and reducing errors. Statistics in Medicine 15: 361387.
Newson, R. 2002. Parameters behind nonparametric statistics: Kendalls tau,
Somers D and median dierences. Stata Journal 2: 4564.
. 2006a. Condence intervals for rank statistics: Percentile slopes, dierences,
and ratios. Stata Journal 6: 497520.
. 2006b. Condence intervals for rank statistics: Somers D and extensions. Stata
Journal 6: 309334.
. 2006c. Ecient calculation of jackknife condence intervals for rank statistics.
Journal of Statistical Software 15: 110.
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal eects. Biometrika 70: 4155.
About the author
Roger Newson is a lecturer in medical statistics at Imperial College London, London, UK,
working principally in asthma research. He wrote the somersd and parmest Stata packages.
The Stata Journal (2010)
10, Number 3, pp. 359368
Using Stata with PHASE and Haploview:
Commands for importing and exporting data
J. Charles Huber Jr.
Department of Epidemiology and Biostatistics
Texas A&M Health Science Center School of Rural Public Health
College Station, TX
jchuber@srph.tamhsc.edu
Abstract. Modern genetics studies require the use of many specialty software
programs for various aspects of the statistical analysis. PHASE is a program often
used to reconstruct haplotypes from genotype data, and Haploview is a program
often used to visualize and analyze single nucleotide polymorphism data. Three
new commands are described for performing these three steps: 1) exporting geno-
type data stored in Stata to PHASE, 2) importing the resulting inferred haplotypes
back into Stata, and 3) exporting the haplotype/single nucleotide polymorphism
data from Stata to Haploview.
Keywords: st0199, phaseout, phasein, haploviewout, genetics, haplotypes, SNPs,
PHASE, haploview
1 Introduction
For a variety of reasons, including favorable power for detecting small eects and
the low cost of genotyping, association studies based on single nucleotide polymor-
phism (SNP, pronounced snip) markers have become common in genetic epidemiology
(Cordell and Clayton 2005). SNP markers are positions along a chromosome that can
have four forms called alleles: adenine, cytosine, guanine, and thymine, which are de-
noted A, C, G, and T, respectively. Humans are diploid organisms, meaning that we
have two copies of each of our chromosomes; thus each SNP is composed of a pair of
alleles called a genotype.
For example, a SNP might have an adenine (A) molecule on one chromosome paired
with a cytosine (C) molecule on the other chromosome. This is often described as an
A/C genotype. When two SNP markers are physically close to one another, a pair of
alleles found on the same chromosome forms a haplotype. For example, a person might
have an A/C genotype for SNP1 and a G/T genotype for SNP2. If the A allele from SNP1
and the G allele from SNP2 are physically located on the same chromosome, they are
said to form an AG haplotype. Similarly, the C allele from SNP1 and the T allele from
SNP2 would form a CT haplotype.
c 2010 StataCorp LP st0199
360 Using Stata with PHASE and Haploview
It has been shown that association studies based on haplotypes are often more pow-
erful than similar studies based on individual SNPs (Akey, Jin, and Xiong 2001). Unfor-
tunately, haplotypes are not observed directly using typical low-cost, high-throughput
laboratory techniques. However, haplotypes can be inferred statistically based on the
observed genotypes.
David G. Clayton of the University of Cambridge has written a useful command for
Stata (snp2hap) that infers haplotypes for pairs of SNPs. In theory, this program could
be used iteratively to infer haplotypes across many SNPs. However, several sophisticated
algorithms have been developed for statistically inferring haplotypes from many SNP
genotypes simultaneously. These algorithms and the software that implement them have
been reviewed and compared elsewhere (Marchini et al. 2006; Stephens and Donnelly
2003; The International HapMap Consortium 2005). In most comparisons, the algo-
rithm used in the PHASE program (Stephens, Smith, and Donnelly 2001) was found to
be the most accurate and is arguably the most frequently used.
Rather than attempt the daunting task of creating a Stata command to imple-
ment the algorithm used in PHASE, a Stata command (phaseout) was developed for
exporting genotype data stored in Stata to an ASCII le formatted as a PHASE input
le. A second program (phasein) was developed to import the inferred haplotype data
back into Stata for subsequent association analyses with programs such as haplologit
(Marchenko et al. 2008). These commands use a group of Statas low-level le com-
mands including file open, file write, file read, and file close.
Once the haplotypes have been inferred for a set of genotypes, one would often like
to know certain attributes of the haplotypes. For example, the alleles of some pairs of
SNPs along a haplotype may tend to be transmitted together from parent to ospring
more frequently than alleles of other pairs of SNPs. This phenomenon, known as linkage
disequilibrium (Devlin and Risch 1995), is often quantied by the r
2
or D

statistics.
Similarly, some contiguous groups of SNPs often called haplotype blocks, may exhibit
high levels of pairwise linkage disequilibrium (Gabriel et al. 2002; Goldstein 2001). High
levels of linkage disequilibrium between two SNPs indicate that much of their statistical
information is redundant, so both SNPs are not necessary for association analyses. One
of the SNPs, called a tagSNP (Zhang et al. 2004), can be selected using one of several
algorithms. A tagSNP can be used in place of the group of redundant SNPs. Typically,
there are several tagSNPs in a group of contiguous SNPs found on a chromosome.
Haploview (Barrett et al. 2005) is a popular software package used for calculating
and visualizing the linkage disequilibrium statistics r
2
and D

, as well as for identifying


haplotype blocks and tagSNPs. The new Stata haploviewout program exports haplo-
type data from Stata to a pair of ASCII les formatted as Haploview input les: a haps
format data le and a haps format locus information le.
The dataset used for the following examples was downloaded from the SeattleSNPs
website (SeattleSNPs 2009) and was modied to include missing data. Genotypes for
47 individuals of African and European descent include 22 SNPs from the vascular
endothelial growth factor (VEGF) gene located on chromosome six.
J. C. Huber Jr. 361
2 The phaseout command
Genotype data stored in Stata are often formatted in a way that is similar to the
following example. In this example, the variable id contains individual identication
numbers, and the variables rs1413711, rs3024987, and rs3024989 contain data on
three SNPs. The genotype X/X indicates that the genotype is missing. The following
example uses ctitious data:
. list id rs1413711 rs3024987 rs3024989 in 1/2
id rs1413711 rs3024987 rs3024989
1. D001 C/C C/T T/T
2. D002 C/T X/X T/T
The input le for PHASE requires the data to be formatted in an ASCII le that
contains header information about the number of samples and the number and types of
markers (SNP or multiallelic), as well as the actual data:
47 (There are 47 samples in the entire le.)
3 (There are three markers in the le.)
P 674 836 1955 (Positions are listed.)
SSS (All three markers are biallelic SNPs.)
D001 (The data begin with the rst ID.)
C C T (The genotype data are stored in two rows.)
C T T (These are not haplotypes yet.)
D002 (The data begin with the second ID.)
C ? T (The missing SNP data are
T ? T stored as question marks.)
The phaseout command calculates the header information, converts the ID and
genotype data to rows, and writes this data to the ASCII le. The types of markers
SNPs or multiallelic markersare automatically determined by tabulating the genotypes
and by examining the length of the genotype in the rst record. If a marker has three
or fewer genotypes (for example, C/C, C/T, T/T) and the length of the genotype in the
rst record is fewer than ve alleles, the marker is treated as a SNP. All other markers
are treated as multiallelic.
2.1 Syntax
phaseout SNPlist, idvariable(varname) filename(lename)
_
missing(string)
separator(string) positions(string)

SNPlist is a list of variables containing SNP genotypes.


362 Using Stata with PHASE and Haploview
2.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identiers.
filename(lename) is required to name the ASCII le that will be created. It is con-
ventional, though not necessary, to name PHASE input les with the extension .inp.
missing(string) may be used to provide a list of genotypes that indicate missing data.
For example, missing data might be included in the dataset as X/X for SNPs and
as 999/999 for multiallelic markers. Multiple missing values may be specied by
placing a space between them (for example, missing("X/X 9/9 999/999")). PHASE
requires missing SNP alleles to be coded as ? and missing multiallelic alleles to
be coded as 1. It is not necessary to preprocess your data because phaseout
will automatically convert each genotype contained in the missing() list to its
appropriate PHASE missing value.
separator(string) species the separator to use when storing genotype data. Genotype
data are often stored with a separator between the two alleles. For data stored in
the format C/G, the separator() option would look like separator("/"). If SNP
data are stored without a separator (for example, CG) then the separator() option
is unnecessary, and phaseout will assume that the left character is allele 1 and the
right character is allele 2.
positions(string) provides a list of the marker positions for use by PHASE when infer-
ring haplotypes from the genotype data. If the positions() option is not specied,
PHASE will assume that the markers are equally spaced.
2.3 Output les
phaseout saves two ASCII les for subsequent use by the commands phasein and
haploviewout:
MarkerList.txt contains a space-delimited list of marker names.
PositionList.txt contains a space-delimited list of marker positions.
2.4 Examples
Markers and positions may be specied in the command itself:
. phaseout rs1413711 rs3024987 rs3024989, idvariable("id") filename("VEGF.inp")
> missing("X/X 9/9") positions("674 836 1955") separator("/")
J. C. Huber Jr. 363
phaseout may use markers and positions saved in local macros:
. local SNPList "rs1413711 rs3024987 rs3024989 rs833068 rs3024990"
. local PositionsList "674 836 1955 2523 3031"
. phaseout `SNPList, idvariable("id") filename("VEGF.inp") missing("X/X 9/9")
> positions(`PositionsList) separator("/")
3 The phasein command
PHASE saves the inferred haplotypes for each pair of chromosomes in a le with the
extension .out, and because there is a great deal of other information saved in the le,
the phasein command uses the keywords BEGIN BESTPAIRS1 and END BESTPAIRS1 to
identify the part of the le that contains the haplotypes:
BEGIN BESTPAIRS1
0 D001
C C T
C T T
......
......
0 E023
C C T
C C T
END BESTPAIRS1.
The data are imported into Stata in long format with one row per chromosome
(two rows per ID). The haplotypes are imported into a variable named haplotype, and
each of the markers that make up the haplotype are saved in an individual variable.
If the markers() option is specied, the marker variables will be renamed using their
original names.
. list id haplotype rs1413711 rs3024987 rs3024989 in 1/2
id haplotype rs1413711 rs3024987 rs3024989
1. D001 CCT C C T
2. D001 CTT C T T
If the positions() option is used, the positions will be placed in the variable label
of each marker variable:
(Continued on next page)
364 Using Stata with PHASE and Haploview
. describe
Contains data from VEGF_Haplotypes.dta
obs: 94
vars: 5 8 Jul 2010 13:09
size: 1,692 (99.9% of memory free)
storage display value
variable name type format label variable label
id str4 %9s
haplotype str3 %9s
rs1413711 str1 %9s position=674
rs3024987 str1 %9s position=836
rs3024989 str1 %9s position=1955
Sorted by:
3.1 Syntax
phasein PhaseOutputFile
_
, markers(lename) positions(lename)

PhaseOutputFile is the name of the PHASE output le that contains the inferred haplo-
types. It will have the le extension .out.
3.2 Options
markers(lename) allows the user to specify an ASCII le that contains the names of
the markers included in the haplotype. If the original genotype data were exported
to PHASE using the phaseout command, the marker names will be automatically
saved to a le named MarkerList.txt. If that is the case, then the option would
be markers("MarkerList.txt"). Alternatively, the user can save a space-delimited
list of marker names in an ASCII le and use the markers("lename.txt") option.
positions(lename) allows the user to specify an ASCII le that contains the posi-
tions of the markers. If the original genotype data were exported to PHASE us-
ing the phaseout command, the marker positions will be automatically saved to
a le named PositionList.txt. If that is the case, then the option would be
positions("PositionList.txt"). Alternatively, the user can save a space-delimit-
ed list of marker positions in an ASCII le and use the positions("lename.txt")
option.
3.3 Examples
Using the default les created by phaseout:
. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")
J. C. Huber Jr. 365
Using the les created by the user:
. phasein VEGF.out, markers("UserMarkerList.txt") positions("UserPositionList.txt")
4 The haploviewout command
The haploviewout command exports haplotype data from Stata to a pair of les. The
le Filename DataInput.txt contains the marker data for each individual, with the
alleles recoded as follows: missing = 0, A = 1, C = 2, G = 3, and T = 4.
D001 D001 2 2 4
D001 D001 2 4 4
D002 D002 2 2 4
D002 D002 4 2 4
The le Filename MarkerInput.txt contains the marker names and positions in
two columns:
rs1413711 674
rs3024987 836
rs3024989 1955
4.1 Syntax
haploviewout SNPlist, idvariable(varname) filename(lename)
_
positions(string) familyid(variable) poslabel

SNPlist is a list of SNP variables in long format (that is, one row per chromosome).
If your data are in wide format, you can convert them to long format by using the
reshape command.
Haploview will not accept multiallelic markers.
4.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identiers.
filename(lename) is required to name the two ASCII les that will be created. Those
les will have the extensions DataInput.txt and MarkerInput.txt appended
to lename. For example, the filename("VEGF") option will create a le named
VEGF DataInput.txt and a le named VEGF MarkerInput.txt. To open the les in
Haploview, select File > Open new data and click on the tab labeled Haps For-
mat. Click on the Browse button next to the box labeled Data File and select the
le VEGF DataInput.txt. Next click on the Browse button next to the box labeled
Locus Information File and select the le VEGF MarkerInput.txt.
366 Using Stata with PHASE and Haploview
positions(string) allows the user to specify a space-delimited list of the marker posi-
tions.
familyid(variable) allows the user to specify the variable that contains family identi-
ers if relatives are included in the dataset. If familyid() is omitted, the
idvariable() will be automatically substituted for the familyid().
poslabel will automatically extract the SNP positions from the variable label of each
SNP if the haplotype data were created using the commands phaseout and phasein.
The positions for each marker are stored in the variable label of each SNP.
4.3 Examples
Using the default les created by phaseout:
. phaseout rs1413711 rs3024987 rs3024989, idvariable("id") filename("VEGF.inp")
> missing("X/X 9/9") positions("674 836 1955") separator("/")
. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")
. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")
> poslabel
Using the les created by the user:
. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")
> positions("674 836 1955")
5 Discussion
Many young and rapidly evolving elds of inquiry, including genetic association studies,
use a variety of boutique software packages. While it would be very convenient to have
Stata commands that accomplish the same tasks, the time and programming expertise
required does not make this a practical option. However, a suite of commands that
allows easy exporting and importing of data from Stata to other specialized software
seems to be an ecient way for Stata users to accomplish specialized analytical tasks.
6 Acknowledgments
This work was supported in part by grant 1 R01 DK073618-02 from the National In-
stitute of Diabetes and Digestive and Kidney Diseases and by grant 2006-35205-16715
from the United States Department of Agriculture. The author would like to thank
Drs. Loren Skow, Krista Fritz, and Candice Brinkmeyer-Langford of the Texas A&M
College of Veterinary Medicine and Roger Newson of the Imperial College London for
their very useful feedback.
J. C. Huber Jr. 367
7 References
Akey, J., L. Jin, and M. Xiong. 2001. Haplotypes vs single marker linkage disequilibrium
tests: What do we gain? European Journal of Human Genetics 9: 291300.
Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005. Haploview: Analysis and
visualization of LD and haplotype maps. Bioinformatics 21: 263265.
Cordell, H. J., and D. G. Clayton. 2005. Genetic association studies. Lancet 366:
11211131.
Devlin, B., and N. Risch. 1995. A comparison of linkage disequilibrium measures for
ne-scale mapping. Genomics 29: 311322.
Gabriel, S. B., S. F. Schaner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Hig-
gins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo,
R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler. 2002. The structure
of haplotype blocks in the human genome. Science 296: 22252229.
Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nature Genetics 29: 109111.
Marchenko, Y. V., R. J. Carroll, D. Y. Lin, C. I. Amos, and R. G. Gutierrez. 2008.
Semiparametric analysis of casecontrol genetic data in the presence of environmental
factors. Stata Journal 8: 305333.
Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S.
Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and The International HapMap Con-
sortium. 2006. A comparison of phasing algorithms for trios and unrelated individuals.
American Journal of Human Genetics 78: 437450.
SeattleSNPs. 2009. NHLBI Program for Genomic Applications.
http://pga.gs.washington.edu.
Stephens, M., and P. Donnelly. 2003. A comparison of Bayesian methods for haplotype
reconstruction from population genotype data. American Journal of Human Genetics
73: 11621169.
Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplo-
type reconstruction from population data. American Journal of Human Genetics 68:
978989.
The International HapMap Consortium. 2005. A haplotype map of the human genome.
Nature 437: 12991320.
Zhang, K., Z. S. Qin, J. S. Liu, T. Chen, M. S. Waterman, and F. Sun. 2004. Haplotype
block partitioning and tag SNP selection using genotype data and their applications
to association studies. Genome Research 14: 908916.
368 Using Stata with PHASE and Haploview
About the author
Chuck Huber is an assistant professor of biostatistics at the Texas A&M Health Science Center
School of Rural Public Health in the Department of Epidemiology and Biostatistics. He works
on projects in a variety of topical areas, but his primary area of interest is statistical genetics.
The Stata Journal (2010)
10, Number 3, pp. 369385
simsum: Analyses of simulation studies
including Monte Carlo error
Ian R. White
MRC Biostatistics Unit
Institute of Public Health
Cambridge, UK
ian.white@mrc-bsu.cam.ac.uk
Abstract. A new Stata command, simsum, analyzes data from simulation studies.
The data may comprise point estimates and standard errors from several analysis
methods, possibly resulting from several dierent simulation settings. simsum can
report bias, coverage, power, empirical standard error, relative precision, average
model-based standard error, and the relative error of the standard error. Monte
Carlo errors are available for all of these estimated quantities.
Keywords: st0200, simsum, simulation, Monte Carlo error, normal approximation,
sandwich variance
1 Introduction
Simulation studies are an important tool for statistical research (Burton et al. 2006), but
they are often poorly reported. In particular, to understand the role of chance in results
of simulation studies, it is important to estimate the Monte Carlo (MC) error, dened
as the standard deviation of an estimated quantity over repeated simulation studies.
However, this error is often not reported: Koehler, Brown, and Haneuse (2009) found
that of 323 articles reporting the results of a simulation study in Biometrics, Biometrika,
and the Journal of the American Statistical Association in 2007, only 8 articles reported
the MC error.
This article describes a new Stata command, simsum, that facilitates analyses of
simulated data. simsum analyzes simulation studies in which each simulated dataset
yields point estimates by one or more analysis methods. Bias, empirical standard error
(SE), and precision relative to a reference method can be computed for each method. If,
in addition, model-based SEs are available, then simsum can compute the average model-
based SE, the relative error in the model-based SE, the coverage of nominal condence
intervals, and the power to reject a null hypothesis. MC errors are available for all
estimated quantities.
c 2010 StataCorp LP st0200
370 simsum: Analyses of simulation studies including Monte Carlo error
2 The simsum command
2.1 Syntax
simsum accepts data in wide or long format.
In wide format, data contain one record per simulated dataset, with results from
multiple analysis methods stored as dierent variables. The appropriate syntax is
simsum estvarlist
_
if
_
in
_
, true(expression) options

where estvarlist is a varlist containing point estimates from one or more analysis meth-
ods.
In long format, data contain one record per analysis method per simulated dataset,
and the appropriate syntax is
simsum estvarname
_
if
_
in
_
, true(expression) methodvar(varname)
id(varlist) options

where estvarname is a variable containing the point estimates, methodvar(varname)


identies the method, and id(varlist) identies the simulated dataset.
2.2 Options
Main options
true(expression) gives the true value of the parameter. This option is required for
calculations of bias and coverage.
methodvar(varname) species that the data are in long format and that each record
represents one analysis of one simulated dataset using the method identied by
varname. The id() option is required with methodvar(). If methodvar() is not
specied, the data must be in wide format, and each record represents all analyses
of one simulated dataset.
id(varlist) uniquely identies the dataset used for each record, within levels of any
by-variables. This is a required option in the long format. The methodvar() option
is required with id().
se(varlist) lists the names of the variables containing the SEs of the point estimates.
For data in long format, this is a single variable.
seprefix(string) species that the names of the variables containing the SEs of the
point estimates be formed by adding the given prex to the names of the variables
containing the point estimates. seprefix() may be combined with sesuffix(string)
but not with se(varlist).
I. R. White 371
sesuffix(string) species that the names of the variables containing the SEs of the
point estimates be formed by adding the given sux to the names of the variables
containing the point estimates. sesuffix() may be combined with seprefix(string)
but not with se(varlist).
Data-checking options
graph requests a descriptive graph of SEs against point estimates.
nomemcheck turns o checking that adequate memory is free. This check aims to avoid
spending calculation time when simsum is likely to fail because of lack of memory.
max(#) species the maximum acceptable absolute value of the point estimates, stan-
dardized to mean 0 and standard deviation 1. The default is max(10).
semax(#) species the maximum acceptable value of the SE as a multiple of the mean
SE. The default is semax(100).
dropbig species that point estimates or SEs beyond the maximum acceptable values
be dropped; otherwise, the command halts with an error. Missing values are always
dropped.
nolistbig suppresses listing of point estimates and SEs that lie outside the acceptable
limits.
listmiss lists observations with missing point estimates or SEs.
Calculation options
level(#) species the condence level for coverages and powers. The default is
level(95) or as set by set level; see [R] level.
by(varlist) summarizes the results by varlist.
mcse reports MC errors for all summaries.
robust requests robust MC errors (see section 4) for the statistics empse, relprec, and
relerror. The default is MC errors based on an assumption of normally distributed
point estimates. robust is only useful if mcse is also specied.
modelsemethod(rmse | mean) species whether the model SE should be summarized as
the root mean squared value (modelsemethod(rmse), the default) or as the arith-
metic mean (modelsemethod(mean)).
ref(string) species the reference method against which relative precisions will be cal-
culated. With data in wide format, string must be a variable name. With data in
long format, string must be a value of the method variable; if the value is labeled,
the label must be used.
372 simsum: Analyses of simulation studies including Monte Carlo error
Options specifying degrees of freedom
The number of degrees of freedom is used in calculating coverages and powers.
df(string) species the degrees of freedom. It may contain a number (to apply to all
methods), a variable name, or a list of variables containing the degrees of freedom
for each method.
dfprefix(string) species that the names of the variables containing the degrees of
freedom be formed by adding the given prex to the names of the variables containing
the point estimates. dfprefix() may be combined with dfsuffix(string) but not
with df(string).
dfsuffix(string) species that the names of the variables containing the degrees of
freedom be formed by adding the given sux to the names of the variables containing
the point estimates. dfsuffix() may be combined with dfprefix(string) but not
with df(string).
Statistic options
If none of the following options are specied, then all available statistics are computed.
bsims reports the number of simulations with nonmissing point estimates.
sesims reports the number of simulations with nonmissing SEs.
bias estimates the bias in the point estimates.
empse estimates the empirical SE, dened as the standard deviation of the point esti-
mates.
relprec estimates the relative precision, dened as the inverse squared ratio of the
empirical SE of this method to the empirical SE of the reference method. This
calculation is slow; omitting it can reduce run time by up to 90%.
modelse estimates the model-based SE. See modelsemethod() above.
relerror estimates the proportional error in the model-based SE, using the empirical
SE as the gold standard.
cover estimates the coverage of nominal condence intervals at the specied level.
power estimates at the specied level the power to reject the null hypothesis that the
true parameter is zero.
Output options
clear loads the summary data into memory.
saving(lename) saves the summary data into lename.
I. R. White 373
nolist suppresses listing of the results and is allowed only when clear or saving() is
specied.
listsep lists results using one table per statistic, giving output that is narrower and
better formatted. The default is to list the results as a single table.
format(string) species the format for printing results and saving summary data. If
listsep is also specied, then up to three formats may be specied: 1) for results
on the scale of the original estimates (bias, empse, and modelse), 2) for percentages
(relprec, relerror, cover, and power), and 3) for integers (bsims and sesims).
The default is the existing format of the (rst) estimate variable for 1 and 2 and
%7.0f for 3.
sepby(varlist) invokes this list option when printing results.
abbreviate(#) invokes this list option when printing results.
gen(string) species the prex for new variables identifying the dierent statistics in
the output dataset. gen() is only useful with clear or saving(). The default is
gen(stat) so that the new identiers are, for example, statnum and statcode.
3 Example
This example is based on, but distinct from, a simulation study comparing dierent
ways to handle missing covariates when tting a Cox model (White and Royston 2009).
One thousand datasets were simulated, each containing normally distributed covariates
x and z and a time-to-event outcome. Both covariates had 20% of their values deleted
independently of all other variables so the data became missing completely at random
(Little and Rubin 2002). Each simulated dataset was analyzed in three ways. A Cox
model was t to the complete cases (CC). Then two methods of multiple imputation using
chained equations (van Buuren, Boshuizen, and Knook 1999), implemented in Stata as
ice (Royston 2004, 2009), were used. The MI LOGT method multiply imputes the missing
values of x and z with the outcome included as log(t) and d, where t is the survival time
and d is the event indicator. The MI T method is the same except that log(t) is replaced
by t in the imputation model. The results are stored in long format, with variable
dataset identifying the simulated dataset number, string variable method identifying
the method used, variable b holding the point estimate, and variable se holding the SE.
The data start like this:
(Continued on next page)
374 simsum: Analyses of simulation studies including Monte Carlo error
dataset method b se
1. 1 CC .7067682 .14651
2. 1 MI_T .6841882 .1255043
3. 1 MI_LOGT .7124795 .1410814
4. 2 CC .3485008 .1599879
5. 2 MI_T .4060082 .1409831
6. 2 MI_LOGT .4287003 .1358589
7. 3 CC .6495075 .1521568
8. 3 MI_T .5028701 .130078
9. 3 MI_LOGT .5604051 .1168512
They are then summarized thus:
. summarize
Variable Obs Mean Std. Dev. Min Max
dataset 3000 500.5 288.7231 1 1000
method 0
b 3000 .5054995 .1396257 -.1483829 1.004529
se 3000 .1375334 .0183683 .0907097 .2281933
simsum produces the following output:
. simsum b, se(se) methodvar(method) id(dataset) true(0.5) mcse
> format(%6.3f %6.1f %6.0f) listsep
Reshaping data to wide format ...
Starting to process results ...
Non-missing point estimates
CC MI_LOGT MI_T
1000 1000 1000
Non-missing standard errors
CC MI_LOGT MI_T
1000 1000 1000
Bias in point estimate
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
0.017 0.005 0.001 0.004 -0.001 0.004
Empirical standard error
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
0.151 0.003 0.132 0.003 0.134 0.003
I. R. White 375
% gain in precision relative to method CC
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
. . 31.0 3.9 26.4 3.8
RMS model-based standard error
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
0.147 0.001 0.135 0.001 0.134 0.001
Relative % error in standard error
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
-2.7 2.2 2.2 2.3 -0.4 2.3
Coverage of nominal 95% confidence interval
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
94.3 0.7 94.9 0.7 94.3 0.7
Power of 5% level test
CC (MCse) MI_LOGT (MCse) MI_T (MCse)
94.6 0.7 96.9 0.5 96.3 0.6
Some points of interest include the following:
Table 3: CC has small-sample bias away from the null.
Tables 4 and 5: CC is inecient compared with MI LOGT and MI T.
Comparing tables 4 and 6 shows that model-based SEs are close to the empirical
values. This is shown more directly in table 7.
Table 8: Coverage of nominal 95% condence intervals also seems ne, which is
not surprising in view of the lack of bias and good model-based SEs.
Table 9: CC lacks power compared with MI LOGT and MI T, which is not surprising
in view of its ineciency.
If dierent formatting of the results is required, the results can be loaded into mem-
ory using the clear option and can then be manipulated.
376 simsum: Analyses of simulation studies including Monte Carlo error
4 Formulas
Assume that the true parameter is and that the ith simulated dataset (i = 1, . . . , n)
yields a point estimate

i
with SE s
i
. Dene
=
1
n

i
V
b

=
1
n 1

i
_

_
2
s
2
=
1
n

i
s
2
i
V
s
2 =
1
n 1

i
_
s
2
i
s
2
_
2
Performance of

: Bias and empse
Bias is dened as E(

i
) and estimated by
estimated bias =
MC error =
_
V
b

/n (1)
Precision is measured by the empirical standard deviation SD
_

i
_
and is estimated by
empirical standard deviation =
_
V
b

MC error =
_
V
b

/2(n 1)
assuming

is normally distributed, as then (n 1)V
b

/var
_

_

2
n1
.
I. R. White 377
Estimation method comparison: relprec
In a small change of notation, consider two estimators

1
and

2
with values

1i
and

2i
in the ith simulated dataset. The relative gain in precision for

2
compared with

1
is
relative gain in precision = V
b
1
/V
b
2
MC error
2V
b
1
V
b
2

1
2
12
n 1
where
12
is the correlation of

1
with

2
.
The MC error expression can be proved by observing the following: 1) var
_
log V
b
1
_
=
var
_
log V
b
2
_
= 2/(n 1); 2) var
_
log
_
V
b
1
/V
b
2
__
= 4(1
V
)/(n 1) where
V
=
corr
_
V
b
1
, V
b
2
_
; and 3)
V
=
2
12
. Result 3 may be derived by observing that V
b


1/n

i
_

_
2
so that under a bivariate normal assumption for
_

1
,

2
_
,
n cov
_
V
b
1
, V
b
2
_
cov
_
_

1
_
2
,
_

2
_
2
_
= cov
_
_

1
_
2
, E
_
_

2
_
2

1
__
= cov
_
_

1
_
2
,
2
12
V
b
2
/V
b
1
_

1
_
2
_
= 2
2
12
V
b
1
V
b
2
where the third step follows because
_

2
_

1
is normal with mean

12
_
V
b
2
/V
b
1
_

1
_
and constant variance.
Performance of model-based SE s
i
: modelse and relerror
The average model-based SE is (by default) computed on the variance scale, because
standard theory yields unbiased estimates of the variance, not of the standard deviation.
average model-based SE s =
_
s
2
MC error
_
V
s
2/4ns
2
using the Taylor series approximation var (X) var
_
X
2
_
/4E(X)
2
.
378 simsum: Analyses of simulation studies including Monte Carlo error
We can now compute the relative error in the model-based SE as
relative error = s/
_
V
b

1 (2)
MC error
_
s/
_
V
b

__
V
s
2/
_
4ns
4
_
+ 1/2(n 1) (3)
assuming that s and V
b

are approximately uncorrelated and using a further Taylor


approximation.
However, if the modelsemethod(mean) option is used, the formulas are
average model-based SE s =
1
n

i
s
i
MC error =

1
n

i
(s
i
s)
2
with consequent adjustments to equations (2) and (3).
Joint performance of

and s
i
: Cover and power
Let z
/2
be the critical value from the normal distribution, or (if the number of degrees
of freedom has been specied) the critical value from the appropriate t distribution.
The coverage of a nominal 100(1 )% condence interval is
coverage C =
1
n

i
1
_
|

i
| < z
/2
s
i
_
MC error =
_
C(1 C)/n
where 1() is the indicator function. The power of a signicance test at the level is
power P =
1
n

i
1
_
|

i
| z
/2
s
i
_
MC error =
_
P(1 P)/n
Robust MC errors
Several of the MC errors presented above require a normality assumption. Alternative
approximations can be derived using an estimating equations method. The empirical
standard deviation,
_
V
b

, can be written as the solution



of the equation

i
_
n
n 1
_

_
2

2
_
= 0
I. R. White 379
The relative precision of

2
compared with

1
can be written as the solution

of

i
_
_

1i

1
_
2

+ 1
__

2i

2
_
2
_
= 0
The relative error in the model-based SE can be written as the solution

of

i
_
s
2
i

_

+ 1
_
2
_

_
2
_
= 0
provided that the modelsemethod(rmse) method is used. (If modelsemethod(mean) is
specied, it is ignored in computing robust MC errors.) Ignoring the uncertainty in the
sample means ,
1
, and
2
, each estimating equation is of the form

i
_
T
i
f
_

_
B
i
_
= 0
so the sandwich variance (White 1982) is given by
var
_
f
_

__

i
_
T
i
f
_

_
B
i
_
2
_

i
B
i
_
2
and using the delta method,
var
_

_
var
_
f
_

__
/f

_
2
Finally, as an attempt to allow for uncertainty in the sample means, we multiply the
sandwich variance by n/(n 1). A rationale is that this agrees exactly with (1) if the
method is applied to the MC error of the bias. However, most simulation studies are
large enough that this correction is unimportant.
5 Evaluations
Most of the formulas used by simsum to compute MC errors involve approximations, so
I evaluated them in two simulation studies.
5.1 Multiple imputation, revisited
First, I repeated 250 times the simulation study described in section 3. The data have
the same format as before, with a new variable, simno, identifying the 250 dierent
simulation studies. I ran simsum twice. In the rst run, each quantity and its MC error
was computed in each simulation study:
. simsum b, true(0.5) methodvar(method) id(dataset) se(se) mcse by(simno)
> bias empse relprec modelse relerror cover power nolist clear
Reshaping data to wide format ...
Starting to process results ...
Results are now in memory.
380 simsum: Analyses of simulation studies including Monte Carlo error
The data are now held in memory, with one record for each statistic for each of the
250 simulation studies. The statistics are identied by the values of a newly created
numerical variable statnum, and the dierent simulation studies are still identied by
simno. The variables bCC, bMI LOGT, and bMI T contain the analysis results for the three
methods. MC errors in variables are suxed with mcse. In the second run, these values
are treated as ordinary output from a simulation study, and the average calculated MC
error is compared with the empirical MC error.
. simsum bCC bMI_LOGT bMI_T, sesuffix(_mcse) by(statnum) mcse gen(newstat)
> empse modelse relerror nolist clear
Warning: found 250 observations with missing values
Starting to process results ...
Results are now in memory.
The 250 observations with missing values refer to the relative precisions, which are
missing for the reference method (CC). Average calculated MC errors for each statistic are
compared in table 1 with empirical MC errors. The calculated MC errors are naturally
similar to those reported in the single simulation study above (some values have been
multiplied by 1,000 for convenience). Empirical MC errors are close to the model-based
values. The only exception is for coverage, where the model-based MC errors appear
rather small for methods CC and MI LOGT. This is likely to be a chance nding, because
there is no doubt about the accuracy of the model-based MC formula for this statistic.
I. R. White 381
T
a
b
l
e
1
.
S
i
m
u
l
a
t
i
o
n
s
t
u
d
y
c
o
m
p
a
r
i
n
g
t
h
r
e
e
w
a
y
s
t
o
h
a
n
d
l
e
i
n
c
o
m
p
l
e
t
e
c
o
v
a
r
i
a
t
e
s
i
n
a
C
o
x
m
o
d
e
l
:
C
o
m
p
a
r
i
s
o
n
o
f
a
v
e
r
a
g
e
c
a
l
c
u
l
a
t
e
d
M
C
e
r
r
o
r
(
C
a
l
c
)
w
i
t
h
e
m
p
i
r
i
c
a
l
M
C
e
r
r
o
r
(
E
m
p
)
f
o
r
v
a
r
i
o
u
s
s
t
a
t
i
s
t
i
c
s
S
t
a
t
i
s
t
i
c
1
C
C
m
e
t
h
o
d
M
I
L
O
G
T
m
e
t
h
o
d
M
I
T
m
e
t
h
o
d
E
m
p
C
a
l
c
%
e
r
r
o
r
2
E
m
p
C
a
l
c
%
e
r
r
o
r
2
E
m
p
C
a
l
c
%
e
r
r
o
r
2
B
i
a
s

1
0
0
0
4
.
7
9
4
.
7
4

1
.
1
(
4
.
4
)
4
.
2
3
4
.
1
7

1
.
3
(
4
.
4
)
4
.
2
2
4
.
1
8

1
.
0
(
4
.
4
)
E
m
p
S
E

1
0
0
0
3
.
3
7
3
.
3
6

0
.
3
(
4
.
5
)
3
.
1
1
2
.
9
5

5
.
2
(
4
.
3
)
3
.
1
1
2
.
9
6

4
.
9
(
4
.
3
)
R
e
l
P
r
e
c
.
.
.
4
.
2
1
3
.
9
7

5
.
7
(
4
.
2
)
4
.
1
3
3
.
9
7

4
.
1
(
4
.
3
)
M
o
d
S
E

1
0
0
0
0
.
5
2
0
.
5
0

3
.
1
(
4
.
3
)
0
.
5
9
0
.
5
9
0
.
3
(
4
.
5
)
0
.
6
0
0
.
5
9

3
.
1
(
4
.
4
)
R
e
l
E
r
r
2
.
1
6
2
.
2
2
2
.
9
(
4
.
6
)
2
.
4
0
2
.
3
4

2
.
6
(
4
.
4
)
2
.
4
3
2
.
3
3

4
.
2
(
4
.
3
)
C
o
v
e
r
0
.
6
2
0
.
7
0
1
3
.
5
(
5
.
1
)
0
.
6
1
0
.
6
8
1
1
.
3
(
5
.
0
)
0
.
6
7
0
.
6
8
1
.
8
(
4
.
6
)
P
o
w
e
r
0
.
7
4
0
.
7
3

1
.
4
(
4
.
4
)
0
.
5
9
0
.
5
9

0
.
4
(
4
.
5
)
0
.
6
0
0
.
5
9

2
.
4
(
4
.
4
)
1
S
t
a
t
i
s
t
i
c
s
a
r
e
a
b
b
r
e
v
i
a
t
e
d
a
s
f
o
l
l
o
w
s
:
B
i
a
s
,
b
i
a
s
i
n
p
o
i
n
t
e
s
t
i
m
a
t
e
;
E
m
p
S
E
,
e
m
p
i
r
i
c
a
l
S
E
;
R
e
l
P
r
e
c
,
%
g
a
i
n
i
n
p
r
e
c
i
s
i
o
n
r
e
l
a
t
i
v
e
t
o
m
e
t
h
o
d
C
C
;
M
o
d
S
E
,
R
M
S
m
o
d
e
l
-
b
a
s
e
d
S
E
;
R
e
l
E
r
r
,
r
e
l
a
t
i
v
e
%
e
r
r
o
r
i
n
S
E
;
C
o
v
e
r
,
c
o
v
e
r
a
g
e
o
f
n
o
m
i
n
a
l
9
5
%
c
o
n

d
e
n
c
e
i
n
t
e
r
v
a
l
;
P
o
w
e
r
,
p
o
w
e
r
o
f
5
%
l
e
v
e
l
t
e
s
t
.
2
R
e
l
a
t
i
v
e
%
e
r
r
o
r
i
n
a
v
e
r
a
g
e
c
a
l
c
u
l
a
t
e
d
S
E
,
w
i
t
h
i
t
s
M
C
e
r
r
o
r
i
n
p
a
r
e
n
t
h
e
s
e
s
.
382 simsum: Analyses of simulation studies including Monte Carlo error
5.2 Nonnormal joint distributions
In a second evaluation, I simulated 100,000 datasets of size n = 100 from the model
X N(0, 1), Y Bern(0.5). I then estimated the parameter in the logistic regression
model
logit P(Y = 1 | X) = +X (4)
in two ways: 1)

LR
was the maximum likelihood estimate from tting the logistic
regression model (4), and 2)

LDA
was the estimate from linear discriminant analysis
(LDA), tting the linear regression model
X| Y N
_
+X,
2
_
and taking

LDA
=

/
2
.
The 100,000 datasets were divided into 100 simulation studies each of 1,000 simu-
lated datasets. The quantities described above and their SEs were calculated for each
simulation study, except that power for testing = 0 was not computed because this
null hypothesis was true. Finally, the empirical MC error of each quantity across simula-
tion studies was compared with the average MC error estimated within each simulation
study.
Results are shown in table 2. The calculated MC error is adequate for all quantities
except for the relative precision of LDA compared with logistic regression, for which the
calculated SE is some three times too small. This appears to be due to the nonnormal
joint distribution of the parameter estimates shown in gure 1. The robust MC errors
perform well in all cases.
I. R. White 383
Table 2. Simulation study comparing LDA with logistic regression: Comparison of
empirical with average calculated MC errors for various statistics
Quantity Method Mean MC error
Empirical Average calculated
Normal Robust
Bias 1000 Logistic 0.41 6.79 6.71 .
LDA 0.41 6.66 6.57 .
Empirical SE 1000 Logistic 212.00 4.78 4.74 5.07
LDA 207.86 4.69 4.65 4.97
% gain in precision Logistic . . . .
LDA 4.027 0.124 0.048 0.131
Model SE 1000 Logistic 207.32 0.51 0.51 .
LDA 203.12 0.48 0.47 .
% error in model SE Logistic 2.16 2.13 2.20 2.26
LDA 2.23 2.18 2.20 2.30
% coverage Logistic 95.36 0.60 0.66 .
LDA 94.70 0.64 0.71 .

.
0
2
0
.
0
2
.
0
4
.
0
6
D
i
f
f
e
r
e
n
c
e
,

L
D
A


L
R
1 .5 0 .5 1
Average of LDA and LR
Figure 1. Scatterplot of the dierence

LDA

LR
against the average
_

LDA
+

LR
_
/2
in 2,000 simulated datasets
384 simsum: Analyses of simulation studies including Monte Carlo error
6 Discussion
I hope that simsum will help statisticians improve the reporting of their simulation
studies. In particular, I hope simsum will help them think about and report MC errors.
If MC errors are too large to enable the desired conclusions to be drawn, then it is
usually straightforward to increase the sample size, a luxury rarely available in applied
research.
For three statistics (empirical SE, and relative precision and relative error in model-
based SE), I have proposed two approximate MC error methods, one based on a normality
assumption and one based on a sandwich estimator. The MC error should only be taken
as a guide, so errors of some 1020% in calculating the MC error are of little importance.
In most cases, both MC error methods performed adequately. However, the normality-
based MC error was about three times too small when evaluating the relative precision of
two estimators with a highly nonnormal joint distribution (gure 1). It is good practice
to examine the marginal and joint distributions of parameter estimates in simulation
studies, and this practice should be used to guide the choice of MC error method.
Other methods are available for estimating MC errors. Koehler, Brown, and Haneuse
(2009) proposed more computationally intensive techniques that are available for im-
plementation in R. Other software (Doornik and Hendry 2009) is available with an
econometric focus.
7 Acknowledgment
This work was supported by MRC grant U.1052.00.006.
8 References
Burton, A., D. G. Altman, P. Royston, and R. L. Holder. 2006. The design of simulation
studies in medical statistics. Statistics in Medicine 25: 42794292.
Doornik, J. A., and D. F. Hendry. 2009. Interactive Monte Carlo Experimentation in
Econometrics Using PcNaive 5. London: Timberlake Consultants Press.
Koehler, E., E. Brown, and S. J.-P. A. Haneuse. 2009. On the assessment of Monte Carlo
error in simulation-based statistical analyses. American Statistician 63: 155162.
Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. 2nd
ed. Hoboken, NJ: Wiley.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227241.
. 2009. Multiple imputation of missing values: Further update of ice, with an
emphasis on categorical variables. Stata Journal 9: 466477.
van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18: 681694.
I. R. White 385
White, H. 1982. Maximum likelihood estimation of misspecied models. Econometrica
50: 125.
White, I. R., and P. Royston. 2009. Imputing missing covariate values for the Cox
model. Statistics in Medicine 28: 19821998.
About the author
Ian R. White is a program leader at the MRC Biostatistics Unit in Cambridge, United Kingdom.
His research interests focus on handling missing data, noncompliance, and measurement error
in the analysis of clinical trials, observational studies, and meta-analysis. He frequently uses
simulation studies.
The Stata Journal (2010)
10, Number 3, pp. 386394
Projection of power and events in clinical trials
with a time-to-event outcome
Patrick Royston
Hub for Trials Methodology Research
MRC Clinical Trials Unit and University College London
London, UK
pr@ctu.mrc.ac.uk
Friederike M.-S. Barthel
Oncology Research & Development
GlaxoSmithKline
Uxbridge, UK
FriederikeB@ctu.mrc.ac.uk
Abstract. In 2005, Barthel, Royston, and Babiker presented a menu-driven Stata
program under the generic name of ART (assessment of resources for trials) to
calculate sample size and power for complex clinical trial designs with a time-to-
event or binary outcome. In this article, we describe a Stata tool called ARTPEP,
which is intended to project the power and events of a trial with a time-to-event
outcome into the future given patient accrual gures so far and assumptions about
event rates and other dening parameters. ARTPEP has been designed to work
closely with the ART program and has an associated dialog box. We illustrate the
use of ARTPEP with data from a phase III trial in esophageal cancer.
Keywords: st0013 2, artpep, artbin, artsurv, artmenu, randomized controlled trial,
time-to-event outcome, power, number of events, projection, ARTPEP, ART
1 Introduction
Barthel, Royston, and Babiker (2005) presented a menu-driven Stata program under
the generic name of ART (assessment of resources for trials) to calculate sample size and
power for complex clinical trial designs with a time-to-event or binary outcome. Briey,
the features of ART include multiarm trials, doseresponse trends, arbitrary failure-
time distributions, nonproportional hazards, nonuniform rates of patient entry, loss to
follow-up, and possible changes from allocated treatment. A full report on the method-
ology and its performancein particular, regarding loss to follow-up, nonproportional
hazards, and treatment crossoveris given by Barthel et al. (2006).
In this article, we concentrate on a new tool that addresses a practical issue in trials
with a time-to-event outcome. Because of staggered entry of patients and the gradual
maturing of the data, the accumulation of events from the date the trial opens is a
process that occurs over a relatively long period of time and with a variable course.
Trials are planned and their resources are assigned under certain critical assumptions.
c 2010 StataCorp LP st0013 2
P. Royston and F. M.-S. Barthel 387
If those assumptions are unrealistic, timely completion of the trial may be threatened.
Because the cumulative number of events is the key indicator of trial maturity and is
the parameter targeted in the sample-size calculation, it is of considerable interest and
relevance to monitor and project this number at particular points during the trial.
The new tool is called ARTPEP (ART projection of events and power). ARTPEP
comprises an ado-le (artpep) and an associated dialog box. It works in conjunction
with the ART system, of which the latest update is included with this article.
2 Example: A trial in advanced esophageal cancer
2.1 Sample-size calculation using ART
As an example, we describe sample-size calculation and ARTPEP analysis of a typical
cancer trial. The OE05 trial in advanced esophageal carcinoma is coordinated by the
MRC Clinical Trials Unit. The protocol is available online at http://www.ctu.mrc.ac.uk/
plugins/StudyDisplay/protocols/OE05%20Protocol%20Version%205%2031st%20July
%202008.pdf. The design, which comprises two randomized groups of patients with
equal allocation, aims to test the hypothesis that a new chemotherapy regimen, in
conjunction with surgery, improves overall survival at 3 years.
According to the protocol, the probability of 3-year survival in this patient group
is 30%, and the trial has 82% power at the 5% two-sided signicance level to detect
an improvement in overall survival to 38%. The overall sample size is stated to be 842
patients, and the required number of events is 673. The plan is to recruit patients over
6 years and to follow up with them for a further 2 years before performing the denitive
analysis of the outcome (overall survival).
The description in the protocol provides nearly all the ingredients for an ART sample-
size and power calculation. The only missing item is the target hazard ratio, which is
ln (0.38) / ln (0.30) = 0.80 under proportional hazards of the treatment eect (a standard
assumption). We rst use the artsurv command (Barthel, Royston, and Babiker 2005)
to verify the sample-size calculation and to set up some of the parameter values needed
by ARTPEP. We supply the other design features, and then we run the artsurv command
to compute the power and events:
(Continued on next page)
388 Projection of power and events in clinical trials
. artsurv, method(l) nperiod(8) ngroups(2) edf0(0.3, 3) hratio(1, 0.80) n(842)
> alpha(0.05) recrt(6)
ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.7, 19 October 2009)
A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,
MRC Clinical Trials Unit, London NW1 2DA, UK.
Type of trial Superiority - time-to-event outcome
Statistical test assumed Unweighted logrank test (local)
Number of groups 2
Allocation ratio Equal group sizes
Total number of periods 8
Length of each period One year
Survival probs per period (group 1) 0.669 0.448 0.300 0.201 0.134 0.090
0.060 0.040
Survival probs per period (group 2) 0.725 0.526 0.382 0.277 0.201 0.146
0.106 0.077
Number of recruitment periods 6
Number of follow-up periods 2
Method of accrual Uniform
Recruitment period-weights 1 1 1 1 1 1 0 0
Hazard ratios as entered (groups 1,2) 1, 0.80
Alpha 0.050 (two-sided)
Power (calculated) 0.824
Total sample size (designed) 842
Expected total number of events 673
Apart from small, unimportant dierences, the protocol power (0.82) and the number
of events (673) are consistent with ARTs results.
2.2 Analysis with ARTPEP
To run ARTPEP successfully, three preliminary steps are required:
1. You must activate the ART and ARTPEP items on the User menu by typing the
command artmenu on.
2. You must compute the relevant sample size for the trial using either the ART
dialog box or the artsurv command. This automatically sets up a global macro
called $S ARTPEP whose contents are used by the artpep command. (A slightly
more convenient alternative with the same result is to use the ART Settings...
button on the ARTPEP dialog box to set up the necessary quantities for ART
without having to run ART or artsurv separately.)
3. To set up additional parameters that ARTPEP needs, you must use the ARTPEP
dialog box, either by typing db artpep or by selecting User > ART > Artpep
from the menu.
As a worked example, we now imagine that the OE05 trial has been running for
1 year and has accrued 100 patients so far. Assuming the survival distribution to be
P. Royston and F. M.-S. Barthel 389
correct, when may we expect to complete the trial (that is, obtain the required number
of events)? To answer this question, we complete the three steps described above. The
resulting empty dialog box is shown in gure 1.
Figure 1. Incomplete ARTPEP dialog box
We now explain the various items that the dialog box needs. The name of the
corresponding option for the artpep command is given in square brackets:
ART Settings...: As already mentioned, this button may be used to set up the
parameters of an ART run if that has not been done already. It accesses the ART
dialog box.
Patients recruited in each period so far [pts]: A period here is 1 year, and we
have recruited 100 patients in the rst period. We therefore enter 100 for this
item.
Additional patients to be recruited [epts]: To get to the 842 patients (we will
use 850), we hope to recruit at about 150 patients per year for the next 5 years,
making a total of 6 years planned recruitment. We enter 150. The program knows
the period in which recruitment is to cease and, by default, repeats the number
150 over the next 5 periods. If we had expected a diering recruitment rate (say,
accelerating toward the end of the trial), we could have entered a dierent number
of patients to be recruited in each period.
390 Projection of power and events in clinical trials
Number of periods over which to project [eperiods]: Let us say we wish to project
events and power over the next 10 years. We enter 10.
Period in which recruitment cease [stoprecruit]: Here enter the number of periods
after which recruitment is to cease. The number must be no smaller than the
number of periods implied by Patients recruited in each period so far [pts]. If
the option is left blank, it is assumed that recruitment continues indenitely. As
already noted, we wish to stop recruitment at 850 patients, which we will achieve
by the end of period 6. We therefore enter 6 for this item.
Period to start reporting projections [startperiod]: Usually, we want to enter 1
here, signifying the start of the trial. By default, if the item is left blank, the
program assumes that the current period is intended. We enter 1.
Save using lename [using]: The numerical results of the artpep run can be saved
to a .dta le for a permanent record or for plotting. We leave the item blank.
Start date of trial (ddmmmyyyy) [datestart]: If we enter the start date, the output
from artpep is conveniently labeled with the calendar date of the end of each
period. We recommend using this option. We enter 01jan2009.
The completed ARTPEP dialog box is shown in gure 2.
Figure 2. Completed ARTPEP dialog box for the OE05 trial
P. Royston and F. M.-S. Barthel 391
After submitting the above setup to Stata (version 10 or later), we get the following
result:
. artpep, pts(100) $S ARTPEP epts(150) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009)
Date year #pats #C-events #events Power
31dec2009 1 100 9 17 0.06498
31dec2010 2 250 36 66 0.14480
31dec2011 3 400 79 146 0.26850
31dec2012 4 550 132 247 0.41622
31dec2013 5 700 193 362 0.56360
31dec2014 6 850 258 488 0.69209
31dec2015 7 850 314 597 0.77737
31dec2016 8 850 351 673 0.82423
31dec2017 9 850 375 726 0.85155
31dec2018 10 850 392 763 0.86825
31dec2019 11 850 403 789 0.87882
The program reports the total number of events (#events) and the number of events in
the control arm (#C-events), which are often of interest. The required total number of
events (that is, both arms combined) of 673 is projected to be reached on 31 December
2016, the end of period 8. We expect 351 events in the control arm by that time. The
projection is not surprising because the accrual gures that have been entered more or
less agree with the trial plan. Nevertheless, the output shows us the expected progress
of the number of events and the power over time. The trial may be monitored (and the
ARTPEP analysis updated) to follow its progress.
The dialog box has, as usual, created and run the necessary artpep command line.
The second item in the command is $S ARTPEP. As already mentioned, it contains
additional information needed by artpep. On displaying its contents, we nd
. display "$S ARTPEP"
alpha(.05) aratios() hratio(1, 0.80) ngroups(2) ni(0) onesided(0) trend(0)
> tunit(1) edf0(0.3, 3) median(0) method(l)
The key pieces of information here are hratio(1, 0.80) and edf0(0.3, 3), which
specify the hazard ratios in groups 1 and 2, and the survival function in group 1,
respectively. All the other items are default values and could be omitted in the present
example. The present example could have been run directly from the command line as
follows:
. artpep, pts(100) edf0(0.3, 3) epts(150) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)
2.3 Sensitivity analysis of the event rate
We have assumed a 30% survival probability 3 years after recruitment. Suppose, in fact,
that the patients do better than thattheir 3-year survival is 40% instead. What eect
would that have on the power and events timeline?
392 Projection of power and events in clinical trials
We need only change the edf0() option to edf0(0.4, 3):
. artpep, pts(100) epts(150) edf0(0.4, 3) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)
Date year #pats #C-events #events Power
31dec2009 1 100 7 13 0.05869
31dec2010 2 250 29 53 0.12410
31dec2011 3 400 65 119 0.22732
31dec2012 4 550 111 205 0.35714
31dec2013 5 700 165 306 0.49586
31dec2014 6 850 224 419 0.62612
31dec2015 7 850 277 522 0.72135
31dec2016 8 850 316 600 0.77974
31dec2017 9 850 345 660 0.81685
31dec2018 10 850 366 705 0.84128
31dec2019 11 850 382 739 0.85784
The time to observe the required number of events has advanced by more than 1 year,
to period 9 (31dec2017).
3 Syntax
Once you have gained a little experience with using the ARTPEP dialog box, you will
nd it more natural and ecient to use the command line. The syntax of artpep is as
follows:
artpep
_
using lename

, pts(numlist) edf0(slist0)
_
epts(numlist)
eperiods(#) stoprecruit(#) startperiod(#) datestart(ddmmmyyyy)
replace artsurv options

4 Options
pts(numlist) is required. numlist species the number of patients recruited in each
period since the start of the trial, that is, since randomization. See help on artsurv
for the denition of a period. The number of items in numlist denes the number
of periods of recruitment so far. For example, pts(23 12 25) species three initial
periods of recruitment, with recruitment of 23 patients in period 1, 12 in period 2,
and 25 in period 3. The current period would be period 3 and would be demarcated
by parallel lines in the output.
edf0(slist0) is required and gives the survival function in the control group (group 1).
This need not be one of the survival distributions to be compared in the trial, unless
hratio() = 1 for at least one of the groups. The format of slist0 is #1 [#2 . . . #r,
#1 #2 . . . #r]. Thus edf0(p
1
p
2
. . . p
r
, t
1
t
2
. . . t
r
) gives the value p
i
for the survival
function for the event time at the end of time period t
i
, i = 1, . . . , r. Instantaneous
event rates (that is, hazards) are assumed constant within time periods; that is, the
P. Royston and F. M.-S. Barthel 393
distribution of time-to-event is assumed to be piecewise exponential. When used in
a given calculation up to period T, t
r
may validly be less than, equal to, or greater
than T. If t
r
T, the rules described in the edf0() option of artsurv are applied
to compute the survival function at all periods T. If t
r
> T, the same calculation
is used but estimated survival probabilities for periods > T are not used in the
calculation at T, although they may of course be used in calculations (for example,
projections of sample size and events) for periods later than T. Be aware that use
of the median() option (an alternative to edf0()) and the fp() option of artsurv
may modify the eects and interpretation of edf0().
epts(numlist) species in numlist the number of additional patients to be recruited in
each period following the recruitment phase dened by the pts() option. For exam-
ple, pts(23 12 25) epts(30 30) would specify three initial periods of recruitment
followed by two further periods. A projection of events and power is required over
the two further periods. The initial recruitment is of 23 patients in period 1, 12 in
period 2, and 25 in period 3; in each of periods 4 and 5, we expect to recruit an
additional 30 patients. If the number of items in (or implied by expanding) numlist
is less than that specied by pts(), the nal value in numlist is replicated as neces-
sary to all subsequent periods. If epts() is not given, the default is that the mean
of the numbers of patients specied in pts() is used for all projections.
eperiods(#) species the number of future periods over which projection of power and
number of events is to be calculated. The default is eperiods(1).
stoprecruit(#) species the number of periods after which recruitment is to cease. #
must be no smaller than the number of periods of recruitment implied by pts(). The
default is stoprecruit(0), meaning to continue recruiting indenitely (no follow-up
phase).
startperiod(#) species # as the period in which to start reporting the projec-
tions of events and power. To report from the beginning of the trial, specify
startperiod(1). Note that startperiod() does not aect the period at which
the calculations are started, only how the results are reported. The default # is the
last period dened by pts().
datestart(ddmmmyyyy) signies the opening date of the trial (that is, when recruit-
ment started), for example, datestart(14oct2009). The date of the end of each
period is used to label the output and is stored in lename if using is specied.
replace allows lename to be replaced if it already exists.
artsurv options are any of the options of artsurv except recrt(), nperiod(), power(),
and n().
(Continued on next page)
394 Projection of power and events in clinical trials
5 Final comments
We have illustrated ARTPEP with a basic example. However, ARTPEP understands
the more complex options of artsurv. Therefore, complex features, including loss to
follow up, treatment crossover, and nonproportional hazards, can be allowed for in the
projection of power and events.
Sometimes it is desirable to make projections on a ner time scale than 1 year,
for example, in 3- or 6-month periods. This is easily done by adjusting the period
parameters used in ART and ARTPEP.
6 References
Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of
sample size and power for multi-arm survival trials allowing for non-uniform accrual,
non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25:
25212542.
Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for complex
sample size calculation in randomized controlled trials with a survival or a binary
outcome: Update. Stata Journal 5: 123129.
About the authors
Patrick Royston is a medical statistician with 30 years of experience, with a strong interest in
biostatistical methods and in statistical computing and algorithms. He now works in cancer
clinical trials and related research issues. Currently, he is focusing on problems of model
building and validation with survival data, including prognostic factor studies; on parametric
modeling of survival data; on multiple imputation of missing values; and on novel clinical trial
designs.
Friederike Barthel is a senior statistician in Oncology Research & Development at Glaxo-
SmithKline. Previously, she worked at the MRC Clinical Trials Unit and the Institute of
Psychiatry. Her current research interests include sample-size issues, particularly concerning
multistage, multiarm trials, microarray study analyses, and competing risks. Friederike has
taught undergraduate courses in statistics at the University of Westminster and at Kingston
University.
The Stata Journal (2010)
10, Number 3, pp. 395407
metaan: Random-eects meta-analysis
Evangelos Kontopantelis
National Primary Care
Research & Development Centre
University of Manchester
Manchester, UK
e.kontopantelis@manchester.ac.uk
David Reeves
Health Sciences Primary Care
Research Group
University of Manchester
Manchester, UK
david.reeves@manchester.ac.uk
Abstract. This article describes the new meta-analysis command metaan, which
can be used to perform xed- or random-eects meta-analysis. Besides the stan-
dard DerSimonian and Laird approach, metaan oers a wide choice of available
models: maximum likelihood, prole likelihood, restricted maximum likelihood,
and a permutation model. The command reports a variety of heterogeneity mea-
sures, including Cochrans Q, I
2
, H
2
M
, and the between-studies variance estimate
b
2
. A forest plot and a graph of the maximum likelihood function can also be
generated.
Keywords: st0201, metaan, meta-analysis, random eect, eect size, maximum
likelihood, prole likelihood, restricted maximum likelihood, REML, permutation
model, forest plot
1 Introduction
Meta-analysis is a statistical methodology that integrates the results of several inde-
pendent clinical trials in general that are considered by the analyst to be combinable
(Huque 1988). Usually, this is a two-stage process: in the rst stage, the appropriate
summary statistic for each study is estimated; then in the second stage, these statis-
tics are combined into a weighted average. Individual patient data (IPD) methods
exist for combining and meta-analyzing data across studies at the individual patient
level. An IPD analysis provides advantages such as standardization (of marker values,
outcome denitions, etc.), follow-up information updating, detailed data-checking, sub-
group analyses, and the ability to include participant-level covariates (Stewart 1995;
Lambert et al. 2002). However, individual observations are rarely available; addition-
ally, if the main interest is in mean eects, then the two-stage and the IPD approaches
can provide equivalent results (Olkin and Sampson 1998).
This article concerns itself with the second stage of the two-stage approach to meta-
analysis. At this stage, researchers can select between two main approachesthe xed-
eects (FE) or the random-eects modelin their eorts to combine the study-level
summary estimates and calculate an overall average eect. The FE model is simpler
and assumes the true eect to be the same (homogeneous) across studies. However, ho-
mogeneity has been found to be the exception rather than the rule, and some degree of
true eect variability between studies is to be expected (Thompson and Pocock 1991).
Two sorts of between-studies heterogeneity exist: clinical heterogeneity stems from dif-
c 2010 StataCorp LP st0201
396 metaan: Random-eects meta-analysis
ferences in populations, interventions, outcomes, or follow-up times, and methodological
heterogeneity stems from dierences in trial design and quality (Higgins and Green 2009;
Thompson 1994). The most common approach to modeling the between-studies variance
is the model proposed by DerSimonian and Laird (1986), which is widely used in generic
and specialist meta-analysis statistical packages alike. In Stata, the DerSimonianLaird
(DL) model is used in the most popular meta-analysis commandsthe recently up-
dated metan and the older but still useful meta (Harris et al. 2008). However, the
between-studies variance component can be estimated using more-advanced (and com-
putationally expensive) iterative techniques: maximum likelihood (ML), prole likeli-
hood (PL), and restricted maximum likelihood (REML) (Hardy and Thompson 1996;
Thompson and Sharp 1999). Alternatively, the estimate can be obtained using non-
parametric approaches, such as the permutations (PE) model proposed by Follmann
and Proschan (1999).
We have implemented these models in metaan, which performs the second stage
of a two-stage meta-analysis and oers alternatives to the DL random-eects model.
The command requires the studies eect estimates and standard errors as input. We
have also created metaeff, a command that provides support in the rst stage of the
two-stage process and complements metaan. The metaeff command calculates for each
study the eect size (standardized mean dierence) and its standard error from the
input parameters supplied by the user, using one of the models described in the Cochrane
Handbook for Systematic Reviews of Interventions (Higgins and Green 2006). For more
details, type ssc describe metaeff in Stata or see Kontopantelis and Reeves (2009).
The metaan command does not oer the plethora of options metan does for in-
putting various types of binary or continuous data. Other useful features in metan
(unavailable in metaan) include stratied meta-analysis, user-input study weights, vac-
cine ecacy calculations, the MantelHaenszel FE method, LAbbe plots, and funnel
plots. The REML model, assumed to be the best model for tting a random-eects
meta-analysis model even though this assumption has not been thoroughly investi-
gated (Thompson and Sharp 1999), has recently been coded in the updated meta-
regression command metareg (Harbord and Higgins 2008) and the new multivariate
random-eects meta-analysis command mvmeta (White 2009). However, the output
and options provided by metaan can be more useful in the univariate meta-analysis
context.
2 The metaan command
2.1 Syntax
metaan varname1 varname2
_
if
_
in

, {fe | dl | ml | reml | pl | pe}


_
varc
label(varname) forest forestw(#) plplot(string)

E. Kontopantelis and D. Reeves 397


where
varname1 is the study eect size.
varname2 is the study eect variation, with standard error used as the default.
2.2 Options
fe ts an FE model that assumes there is no heterogeneity between the studies. The
model assumes that within-study variances may dier, but that there is homogeneity
of eect size across studies. Often the homogeneity assumption is unlikely, and
variation in the true eect across studies is to be expected. Therefore, caution is
required when using this model. Reported heterogeneity measures are estimated
using the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
dl ts a DL random-eects model, which is the most commonly used model. The model
assumes heterogeneity between the studies; that is, it assumes that the true eect
can be dierent for each study. The model assumes that the individual-study true
eects are distributed with a variance
2
around an overall true eect, but the model
makes no assumptions about the form of the distribution of either the within-study
or the between-studies eects. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
ml ts an ML random-eects model. This model makes the additional assumption
(necessary to derive the log-likelihood function, and also true for reml and pl, below)
that both the within-study and the between-studies eects have normal distributions.
It solves the log-likelihood function iteratively to produce an estimate of the between-
studies variance. However, the model does not always converge; in some cases, the
between-studies variance estimate is negative and set to zero, in which case the
model is reduced to an fe specication. Estimates are reported as missing in the
event of nonconvergence. Reported heterogeneity measures are estimated using the
ml option. You must specify one of fe, dl, ml, reml, pl, or pe.
reml ts an REML random-eects model. This model is similar to ml and uses the same
assumptions. The log-likelihood function is maximized iteratively to provide esti-
mates, as in ml. However, under reml, only the part of the likelihood function that
is location invariant is maximized (that is, maximizing the portion of the likelihood
that does not involve if estimating
2
, and vice versa). The model does not always
converge; in some cases, the between-studies variance estimate is negative and set
to zero, in which case the model is reduced to an fe specication. Estimates are re-
ported as missing in the event of nonconvergence. Reported heterogeneity measures
are estimated using the reml option. You must specify one of fe, dl, ml, reml, pl,
or pe.
pl ts a PL random-eects model. This model uses the same likelihood function as ml
but takes into account the uncertainty associated with the between-studies variance
estimate when calculating an overall eect, which is done by using nested iterations
to converge to a maximum. The condence intervals (CIs) provided by the model
are asymmetric, and hence so is the diamond in the forest plot. However, the model
398 metaan: Random-eects meta-analysis
does not always converge. Values that were not computed are reported as missing.
Reported heterogeneity measures are estimated using the ml option because and

2
, the eect and between-studies variance estimates, are the same. Only their
CIs are reestimated. The model also provides a CI for the between-studies variance
estimate. You must specify one of fe, dl, ml, reml, pl, or pe.
pe ts a PE random-eects model. This model can be described in three steps. First, in
line with a null hypothesis that all true study eects are zero and observed eects
are due to random variation, a dataset of all possible combinations of observed
study outcomes is created by permuting the sign of each observed eect. Then, the
dl model is used to compute an overall eect for each combination. Finally, the
resulting distribution of overall eect sizes is used to derive a CI for the observed
overall eect. The CI provided by the model is asymmetric, and hence so is the
diamond in the forest plot. Reported heterogeneity measures are estimated using
the dl option. You must specify one of fe, dl, ml, reml, pl, or pe.
varc species that the study-eect variation variable, varname2, holds variance values.
If this option is omitted, metaan assumes that the variable contains standard-error
values (the default).
label(varname) selects labels for the studies. One or two variables can be selected
and converted to strings. If two variables are selected, they will be separated by a
comma. Usually, the author names and the year of study are selected as labels. The
nal string is truncated to 20 characters.
forest requests a forest plot. The weights from the specied analysis are used for
plotting symbol sizes (pe uses dl weights). Only one graph output is allowed in each
execution.
forestw(#) requests a forest plot with adjusted weight ratios for better display. The
value can be in the [1, 50] range. For example, if the largest to smallest weight ratio
is 60 and the graph looks awkward, the user can use this command to improve the
appearance by requesting that the weight be rescaled to a largest/smallest weight
ratio of 30. Only the weight squares in the plot are aected, not the model. The CIs
in the plot are unaected. Only one graph output is allowed in each execution.
plplot(string) requests a plot of the likelihood function for the average eect or
between-studies variance estimate of the ml, pl, or reml model. The plplot(mu) op-
tion xes the average eect parameter to its model estimate in the likelihood function
and creates a two-way plot of
2
versus the likelihood function. The plplot(tsq)
option xes the between-studies variance to its model estimate in the likelihood
function and creates a two-way plot of versus the likelihood function. Only one
graph output is allowed in each execution.
E. Kontopantelis and D. Reeves 399
2.3 Saved results
metaan saves the following in r() (some varying by selected model):
Scalars
r(Hsq) heterogeneity measure H
2
M
r(Q) Cochrans Q value
r(df) degrees of freedom
r(effvar) eect variance
r(efflo) eect size, lower 95% CI
r(Isq) heterogeneity measure I
2
r(Qpval) p-value for Cochrans Q
r(eff) eect size
r(effup) eect size, upper 95% CI
In addition to the standard results, metaan, fe and metaan, dl save the following in
r():
Scalars
r(tausq dl) b
2
, from the DL model
In addition to the standard results, metaan, ml saves the following in r():
Scalars
r(tausq dl) b
2
, from the DL model
r(conv ml) ML convergence information
r(tausq ml) b
2
, from the ML model
In addition to the standard results, metaan, reml saves the following in r():
Scalars
r(tausq dl) b
2
, from the DL model
r(conv reml) REML convergence information
r(tausq reml) b
2
, from the REML model
In addition to the standard results, metaan, pl saves the following in r():
Scalars
r(tausq dl) b
2
, from the DL model
r(tausqlo pl) b
2
(PL), lower 95% CI
r(cloeff pl) convergence information, PL eect size (lower CI)
r(ctausqlo pl) convergence information, PL b
2
(lower CI)
r(conv ml) ML convergence information
r(tausq pl) b
2
, from the PL model
r(tausqup pl) b
2
(PL), upper 95% CI
r(cupeff pl) convergence information, PL eect size (upper CI)
r(ctausqup pl) convergence information, PL b
2
(upper CI)
In addition to the standard results, metaan, pe saves the following in r():
Scalars
r(tausq dl) b
2
, from the DL model
r(exec pe) information on PE execution
400 metaan: Random-eects meta-analysis
In each case, heterogeneity measures H
2
M
and I
2
are computed using the returned
between-variances estimate
2
. Convergence and PE execution information is returned
as 1 if successful and as 0 otherwise. r(effvar) cannot be computed for PE. r(effvar)
is the same for ML and PL, but for PL the CIs are amended to take into account the

2
uncertainty.
3 Methods
The metaan command oers six meta-analysis models for calculating a mean eect esti-
mate and its CIs: FE model, random-eects DL method, ML random-eects model, REML
random-eects model, PL random-eects model, and PE method using a DL random-
eects model. Models of the random-eects family take into account the identied
between-studies variation, estimate it, and usually produce wider CIs for the overall
eect than would an FE analysis. Brief descriptions of the models have been provided
in section 2.2. In this section, we will provide a few more details and practical advice in
selecting among the models. Their complexity prohibits complete descriptions in this
article, and users wishing to look into model details are encouraged to refer to the orig-
inal articles that described them (DerSimonian and Laird 1986; Hardy and Thompson
1996; Follmann and Proschan 1999; Brockwell and Gordon 2001).
The three ML models are iterative and usually computationally expensive. ML and PL
derive the (overall eect) and
2
estimates by maximizing the log-likelihood function
in (1) under dierent conditions. REML estimates
2
and by maximizing the restricted
log-likelihood function in (2).
log L(,
2
) =
1
2
_
k

i=1
log
_
2
_

2
i
+
2
__
+
k

i=1
( y
i
)
2

2
i
+
2
_
, &
2
0 (1)
log L

(,
2
) =
1
2
_
k

i=1
log
_
2
_

2
i
+
2
__
+
k

i=1
( y
i
)
2

2
i
+
2
_

1
2
log
k

i=1
1

2
i
+
2
, &
2
0 (2)
where k is the number of studies to be meta-analyzed, y
i
and
2
i
are the eect and
variance estimates for study i, and is the overall eect estimate.
ML follows the simplest approach, maximizing (1) in a single iteration loop. A criti-
cism of ML is that it takes no account of the loss in degrees of freedom that results from
estimating the overall eect. REML derives the likelihood function in a way that adjusts
for this and removes downward bias in the between-studies variance estimator. A use-
ful description for REML, in the meta-analysis context, has been provided by Normand
(1999). PL uses the same likelihood function as ML, but uses nested iterations to take
into account the uncertainty associated with the between-studies variance estimate when
calculating an overall eect. By incorporating this extra factor of uncertainty, PL yields
E. Kontopantelis and D. Reeves 401
CIs that are usually wider than for DL and also are asymmetric. PL has been shown to
outperform DL in various scenarios (Brockwell and Gordon 2001).
The PE model (Follmann and Proschan 1999) can be described as follows: First, in
line with a null hypothesis that all true study eects are zero and observed eects are due
to random variation, a dataset of all possible combinations of observed study outcomes
is created by permuting the sign of each observed eect. Next the dl model is used to
compute an overall eect for each combination. Finally, the resulting distribution of
overall eect sizes is used to derive a CI for the observed overall eect.
Method performance is known to be aected by three factors: the number of studies
in the meta-analysis, the degree of heterogeneity in true eects, andprovided there is
heterogeneity presentthe distribution of the true eects (Brockwell and Gordon 2001).
Heterogeneity, which is attributed to clinical or methodological diversity (Higgins and
Green 2006), is a major problem researchers have to face when combining study results
in a meta-analysis. The variability that arises from dierent interventions, populations,
outcomes, or follow-up times is described by clinical heterogeneity, while dierences in
trial design and quality are accounted for by methodological heterogeneity (Thompson
1994). Traditionally, heterogeneity is tested with Cochrans Q, which provides a p-value
for the test of homogeneity, when compared with a
2
k1
distribution where k is the
number of studies (Brockwell and Gordon 2001). However, the test is known to be poor
at detecting heterogeneity because its power is low when the number of studies is small
(Hardy and Thompson 1998). An alternative measure is I
2
, which is thought to be more
informative in assessing inconsistency between studies. I
2
values of 25%, 50%, and 75%
correspond to low, moderate, and high heterogeneity, respectively (Higgins et al. 2003).
Another measure is H
2
M
, the measure least aected by the value of k. It takes values in
the [0, +) range, with 0 indicating perfect homogeneity (Mittlb ock and Heinzl 2006).
Obviously, the between-studies variance estimate
2
can also be informative about the
presence or absence of heterogeneity.
The test for heterogeneity is often used as the basis for applying an FE or a random-
eects model. However, the often low power of the Q test makes it unwise to base a
decision on the result of the test alone. Research studies, even on the same topic, can
vary on a large number of factors; hence, homogeneity is often an unlikely assumption
and some degree of variability between studies is to be expected (Thompson and Pocock
1991). Some authors recommend the adoption of a random-eects model unless there
are compelling reasons for doing otherwise, irrespective of the outcome of the test for
heterogeneity (Brockwell and Gordon 2001).
However, even though random-eects methods model heterogeneity, the performance
of the ML models (ML, REML, and PL) in situations where the true eects violate the
assumptions of a normal distribution may not be optimal (Brockwell and Gordon 2001;
Hardy and Thompson 1998; Bohning et al. 2002; Sidik and Jonkman 2007). The num-
ber of studies in the analysis is also an issue, because most meta-analysis models (includ-
ing DL, ML, REML, and PLbut not PE) are only asymptotically correct; that is, they
provide the theoretical 95% coverage only as the number of studies increases (approaches
innity). Method performance is therefore aected when the number of studies is small,
402 metaan: Random-eects meta-analysis
but the extent depends on the model (some are more susceptible), along with the degree
of heterogeneity and the distribution of the true eects (Brockwell and Gordon 2001).
4 Example
As an example, we apply the metaan command to health-risk outcome data from seven
studies. The information was collected for an unpublished meta-analysis, and the data
are available from the authors. Using the describe and list commands, we provide
details of the dataset and proceed to perform a univariate meta-analysis with metaan.
. use metaan_example
. describe
Contains data from metaan_example.dta
obs: 7
vars: 4 19 Apr 2010 12:19
size: 560 (99.9% of memory free)
storage display value
variable name type format label variable label
study str16 %16s First author and year
outcome str48 %35s Outcome description
effsize float %9.0g effect sizes
se float %9.0g SE of the effect sizes
Sorted by: study outcome
. list study outcome effsize se, noobs clean
study outcome effsize se
Bakx A, 1985 Serum cholesterol (mmol/L) -.3041526 .0958199
Campbell A, 1998 Diet .2124063 .0812414
Cupples, 1994 BMI .0444239 .090661
Eckerlund SBP -.3991309 .12079
Moher, 2001 Cholesterol (mmol/l) -.9374746 .0691572
Woolard A, 1995 Alcohol intake (g/week) -.3098185 .206331
Woolard B, 1995 Alcohol intake (g/week) -.4898825 .2001602
E. Kontopantelis and D. Reeves 403
. metaan effsize se, pl label(study) forest
Profile Likelihood method selected
Study Effect [95% Conf. Interval] % Weight
Bakx A, 1985 -0.304 -0.492 -0.116 15.09
Campbell A, 1998 0.212 0.053 0.372 15.40
Cupples, 1994 0.044 -0.133 0.222 15.20
Eckerlund -0.399 -0.636 -0.162 14.49
Moher, 2001 -0.937 -1.073 -0.802 15.62
Woolard A, 1995 -0.310 -0.714 0.095 12.01
Woolard B, 1995 -0.490 -0.882 -0.098 12.19
Overall effect (pl) -0.308 -0.622 0.004 100.00
ML method succesfully converged
PL method succesfully converged for both upper and lower CI limits
Heterogeneity Measures
value df p-value
Cochrane Q 139.81 6 0.000
I^2 (%) 91.96
H^2 11.44
value [95% Conf. Interval]
tau^2 est 0.121 0.000 0.449
Estimate obtained with Maximum likelihood - Profile likelihood provides the CI
PL method succesfully converged for both upper and lower CI limits of the tau^2
> estimate
The PL model used in the example converged successfully, as did ML, whose convergence
is a prerequisite. The overall eect is not found to be signicant at the 95% level,
and there is considerable heterogeneity across studies, according to the measures. The
model also displays a 95% CI for the between-studies variance estimate
2
(provided
that convergence is achieved, as is the case in this example). The forest plot created by
the command is displayed in gure 1.
(Continued on next page)
404 metaan: Random-eects meta-analysis
Overall effect (pl)
Woolard B, 1995
Woolard A, 1995
Moher, 2001
Eckerlund
Cupples, 1994
Campbell A, 1998
Bakx A, 1985
S
t
u
d
i
e
s
1.1 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 .1 .2 .3 .4
Effect sizes and CIs
Original weights (squares) displayed. Largest to smallest ratio: 1.30
Figure 1. Forest plot displaying PL meta-analysis
When we reexecute the analysis with the plplot(mu) and plplot(tsq) options, we
obtain the log-likelihood function plots shown in gures 2 and 3.
10
8
6
4
2
l
o
g

l
i
k
e
l
i
h
o
o
d
0 .05 .1 .15 .2
tau values
for mu fixed to the ML/PL estimate
Likelihood plot
Figure 2. Log-likelihood function plot for xed to the model estimate
E. Kontopantelis and D. Reeves 405
25
20
15
10
5
0
l
o
g

l
i
k
e
l
i
h
o
o
d
1.5 1 .5 0 .5
mu values
for tau fixed to the ML/PL estimate
Likelihood plot
Figure 3. Log-likelihood function plot for
2
xed to the model estimate
5 Discussion
The metaan command can be a useful meta-analysis tool that includes newer and, in
certain circumstances, better-performing models than the standard DL random-eects
model. Unpublished results exploring model performance in various scenarios are avail-
able from the authors. Future work will involve implementing more models in the
metaan command and embellishing the forest plot.
6 Acknowledgments
We would like to thank the authors of meta and metan for all their work and the
anonymous reviewer whose useful comments improved the article considerably.
7 References
Bohning, D., U. Malzahn, E. Dietz, P. Schlattmann, C. Viwatwongkasem, and A. Big-
geri. 2002. Some general points in estimating heterogeneity variance with the
DerSimonianLaird estimator. Biostatistics 3: 445457.
Brockwell, S. E., and I. R. Gordon. 2001. A comparison of statistical methods for
meta-analysis. Statistics in Medicine 20: 825840.
DerSimonian, R., and N. Laird. 1986. Meta-analysis in clinical trials. Controlled Clinical
Trials 7: 177188.
406 metaan: Random-eects meta-analysis
Follmann, D. A., and M. A. Proschan. 1999. Valid inference in random eects meta-
analysis. Biometrics 55: 732737.
Harbord, R. M., and J. P. T. Higgins. 2008. Meta-regression in Stata. Stata Journal 8:
493519.
Hardy, R. J., and S. G. Thompson. 1996. A likelihood approach to meta-analysis with
random eects. Statistics in Medicine 15: 619629.
. 1998. Detecting and describing heterogeneity in meta-analysis. Statistics in
Medicine 17: 841856.
Harris, R. J., M. J. Bradburn, J. J. Deeks, R. M. Harbord, D. G. Altman, and J. A. C.
Sterne. 2008. metan: Fixed- and random-eects meta-analysis. Stata Journal 8: 328.
Higgins, J. P. T., and S. Green. 2006. Cochrane Handbook for Systematic Reviews of
Interventions Version 4.2.6.
http://www2.cochrane.org/resources/handbook/Handbook4.2.6Sep2006.pdf.
. 2009. Cochrane Handbook for Systematic Reviews of Interventions Version
5.0.2. http://www.cochrane-handbook.org/.
Higgins, J. P. T., S. G. Thompson, J. J. Deeks, and D. G. Altman. 2003. Measuring
inconsistency in meta-analyses. British Medical Journal 327: 557560.
Huque, M. F. 1988. Experiences with meta-analysis in NDA submissions. Proceedings
of the Biopharmaceutical Section of the American Statistical Association 2: 2833.
Kontopantelis, E., and D. Reeves. 2009. MetaEasy: A meta-analysis add-in for Microsoft
Excel. Journal of Statistical Software 30: 125.
Lambert, P. C., A. J. Sutton, K. R. Abrams, and D. R. Jones. 2002. A comparison
of summary patient-level covariates in meta-regression with individual patient data
meta-analysis. Journal of Clinical Epidemiology 55: 8694.
Mittlb ock, M., and H. Heinzl. 2006. A simulation study comparing properties of het-
erogeneity measures in meta-analyses. Statistics in Medicine 25: 43214333.
Normand, S.-L. T. 1999. Meta-analysis: Formulating, evaluating, combining, and re-
porting. Statistics in Medicine 18: 321359.
Olkin, I., and A. Sampson. 1998. Comparison of meta-analysis versus analysis of variance
of individual patient data. Biometrics 54: 317322.
Sidik, K., and J. N. Jonkman. 2007. A comparison of heterogeneity variance estimators
in combining results of studies. Statistics in Medicine 26: 19641981.
Stewart, L. A. 1995. Practical methodology of meta-analyses (overviews) using updated
individual patient data. Statistics in Medicine 14: 20572079.
E. Kontopantelis and D. Reeves 407
Thompson, S. G. 1994. Systematic review: Why sources of heterogeneity in meta-
analysis should be investigated. British Medical Journal 309: 13511355.
Thompson, S. G., and S. J. Pocock. 1991. Can meta-analyses be trusted? Lancet 338:
11271130.
Thompson, S. G., and S. J. Sharp. 1999. Explaining heterogeneity in meta-analysis: A
comparison of methods. Statistics in Medicine 18: 26932708.
White, I. R. 2009. Multivariate random-eects meta-analysis. Stata Journal 9: 4056.
About the authors
Evangelos (Evan) Kontopantelis is a research fellow in statistics at the National Primary Care
Research and Development Centre, University of Manchester, England. His research interests
include statistical methods in health sciences with a focus on meta-analysis, longitudinal data
modeling, and large clinical database management.
David Reeves is a senior research fellow in statistics at the Health Sciences Primary Care
Research Group, University of Manchester, England. David has worked as a statistician in
health services research for nearly three decades, mainly in the elds of learning disability
and primary care. His methodological research interests include the robustness of statistical
methods, the analysis of observational studies, and applications of social network analysis
methods to health systems.
The Stata Journal (2010)
10, Number 3, pp. 408422
Regression analysis of censored data using
pseudo-observations
Erik T. Parner
University of Aarhus
Aarhus, Denmark
parner@biostat.au.dk
Per K. Andersen
University of Copenhagen
Copenhagen, Denmark
P.K.Andersen@biostat.ku.dk
Abstract. We draw upon a series of articles in which a method based on pseu-
dovalues is proposed for direct regression modeling of the survival function, the
restricted mean, and the cumulative incidence function in competing risks with
right-censored data. The models, once the pseudovalues have been computed, can
be t using standard generalized estimating equation software. Here we present
Stata procedures for computing these pseudo-observations. An example from a
bone marrow transplantation study is used to illustrate the method.
Keywords: st0202, stpsurv, stpci, stpmean, pseudovalues, time-to-event, survival
analysis
1 Introduction
Statistical methods in survival analysis need to deal with data that are incomplete
because of right-censoring; a host of such methods are available, including the Kaplan
Meier estimator, the log-rank test, and the Cox regression model. If one had complete
data, standard methods for quantitative data could be applied directly for the observed
survival time X, or methods for binary outcomes could be applied by dichotomizing
X as I(X > ) for a suitably chosen . With complete data, one could furthermore
set up regression models for any function f(X) and check such models using standard
graphical methods such as scatterplots or residuals for quantitative or binary outcomes.
One way of achieving these goals with censored survival data and with more-general
event history data (for example, competing-risks data) is to use a technique based on
pseudo-observations, as recently described in a series of articles. Thus the technique
has been studied in modeling of the survival function (Klein et al. 2007), the restricted
mean (Andersen, Hansen, and Klein 2004), and the cumulative incidence function in
competing risks (Andersen, Klein, and Rosthj 2003; Klein and Andersen 2005; Klein
2006; Andersen and Klein 2007).
The basic idea is simple. Suppose a well-behaved estimator

, for the expectation =
E{f(X)}, is availablefor example, the KaplanMeier estimator for S(t) = E{I(X >
t)}based on a sample of size n. The ith pseudo-observation (i = 1, . . . , n) for f(X) is
then dened as

i
= n

(n1)

i
where

i
is the estimator applied to the sample
of size n1, which is obtained by eliminating the ith observation from the dataset. The
pseudovalues are generated once, and the idea is to replace the incompletely observed
c 2010 StataCorp LP st0202
E. T. Parner and P. K. Andersen 409
f(X
i
) by

i
. That is,

i
may be used as an outcome variable in a regression model
or it may be used to compute residuals.

i
also may be used in a scatterplot when
assessing model assumptions (Perme and Andersen 2008; Andersen and Perme 2010).
The intuition is that, in the absence of censoring, = E{f(X)} could, obviously, be
estimated as (1/n)

i
f(X
i
), in which case the ith pseudo-observation is simply the
observed value f(X
i
). The pseudovalues are related to the jackknife residuals used in
regression diagnostics.
We present three new Stata commandsstpsurv, stpci, and stpmeanthat pro-
vide a new possibility in Stata for analyzing regression models and that generate pseu-
dovalues (respectively) for the survival function (or the cumulative distribution func-
tion, the cumulative incidence) under right-censoring, for the cumulative incidence
in competing risks, and for the restricted mean under right-censoring. Cox regression
models can be t using the pseudovalue function for survival probabilities in several
time points. Thereby, the pseudovalue method provides an alternative to Cox regres-
sion, for example, in situations where rates are not proportional. As discussed by
Perme and Andersen (2008), residuals for model checking may also be obtained from
the pseudovalues. An example based on bone marrow transplantation data is presented
to illustrate the methodology.
In section 2, we briey present the general pseudovalue approach to censored data
regression. In section 3, we present the new Stata commands; and in section 4, we show
examples of the use of the commands. Section 5 concludes with some remarks.
2 Some methodological details
2.1 The general approach
In this section, we briey introduce censored data regression based on pseudo-obser-
vations; see, for example, Andersen, Klein, and Rosthj (2003) or Andersen and Perme
(2010) for more details. Let X
1
, . . . , X
n
be independent and identically distributed
survival times, and suppose we are interested in a parameter of the form
= E{f(X)}
for some function f(). This function could be multivariate, for example,
f(X) = {f
1
(X), . . . , f
M
(X)} = {I(X >
1
), . . . , I(X >
M
)}
for a series of time points
1
, . . . ,
M
, in which case,
= (
1
, . . . ,
M
) = {S(
1
), . . . , S(
M
)}
where S() is the survival function for X. More examples are provided below. Fur-
thermore, let Z
1
, . . . , Z
n
be independent and identically distributed covariates. Also
suppose we are interested in a regression model of = E{f(X
i
)} on Z
i
for example,
a generalized linear model of the form
g[E{f(X
i
) | Z
i
}] =
T
Z
i
410 Pseudo-observations
where g() is the link function. If right-censoring prevents us from observing all the
X
i
s, then it is not simple to analyze this regression model. However, suppose

is
an approximately unbiased estimator of the marginal mean = E{f(X)} that may
be computed from the sample of right-censored observations. If f(X) = I(X > ),
then = S() may be estimated using the KaplanMeier estimator. The ith pseudo-
observation is now dened, as suggested in section 1, as

i
= n

(n 1)

i
Here

i
is the leave-one-out estimator for based on all observations but the ith:
X
j
, j = i. The idea is to replace the possibly incompletely observed f(X
i
) by

i
and to
obtain estimates of the s based on the estimating equation:

i
_

g
1
(
T
Z
i
)
_
T
V
1
i
()
_

i
g
1
(
T
Z
i
)
_
=

i
U
i
() = U() = 0 (1)
In (1), V
i
is a working covariance matrix. Graw, Gerds, and Schumacher (2009) showed
that for the examples studied in this article, E{f(X
i
) | Z
i
} = E(

i
| Z
i
), and thereby
(1) is unbiased, provided that censoring is independent of covariates; see also Andersen
and Perme (2010). A sandwich estimator is used to estimate the variance of

. Let
I() =

i
_

g
1
(
T
Z
i
)
_
T
V
1
i
()
_
g
1
(
T
Z
i
)

_
and

Var
_
U
_

__
=

i
U
i
_

_
T
U
i
_

_
then

Var
_

_
= I
_

_
1

Var
_
U
_

__
I
_

_
1
The estimator of can be shown to be asymptotically normal (Graw, Gerds, and
Schumacher 2009; Liang and Zeger 1986), and the sandwich estimator converges in
probability to the true variance. Once the pseudo-observations have been computed,
the estimators of can be obtained by using standard software for generalized estimating
equations.
The pseudo-observations may also be used to dene residuals after tting some
standard model (for example, a Cox regression model) for survival data; see Perme and
Andersen (2008) or Andersen and Perme (2010).
2.2 The survival function
Suppose we are interested in the survival function S(
j
) = Pr(X >
j
) at a grid of
time points
1
< <
M
, for a survival time X. Hence, = (
1
, . . . ,
M
) where
E. T. Parner and P. K. Andersen 411

j
= S(
j
). When M = 1, we consider the survival function at a single point in time.
Under right-censoring, the survival function is estimated by the KaplanMeier estimator
(Kaplan and Meier 1958),

S(t) =

tjt
Y
j
d
j
Y
j
where t
1
< < t
D
are the distinct event times, Y
j
is the number at risk, and d
j
is the
number of events at time t
j
. The cumulative distribution function is then estimated by

F(t) = 1

S(t). In this case, the link function of interest could be the cloglog function
cloglog {F()} = log [log {1 F()}]
which is equivalent to a Cox regression model for the survival function evaluated in .
2.3 The mean survival time
The mean time-to-event is the area under the survival curve:
=
_

0
S(u)du (2)
For right-censored data, the estimated survival function (the KaplanMeier estimator)
does not always converge down to zero. Then the mean cannot be estimated reliably
by plugging the KaplanMeier estimator into (2). An alternative to the mean is the
restricted mean, dened as the area under the survival curve up to a time <
(Klein and Moeschberger 2003), which is equal to =

= E{min(X, )}. The re-


stricted mean survival time is estimated by the area under the KaplanMeier curve up
to time . That is,

=
_

0

S(u)du
An alternative mean is the conditional mean given that the event time is smaller than
,
c

= E(X| X ), which is similarly estimated by



c

=
_

0

S(u)

S()
1

S()
du
For the restricted and conditional mean, a link function of interest could be the log or
the identity.
2.4 The cumulative incidence
Under competing risks, the cumulative incidence function is estimated in a dierent
way. Suppose the event of interest has hazard function h
1
(t) and the competing risk
has hazard function h
2
(t). The cumulative incidence function for the event of interest
is then given as
F
1
(t) =
_
t
0
h
1
(u) exp
_

_
u
0
{h
1
(v) +h
2
(v)} dv
_
du
412 Pseudo-observations
If t
1
< < t
D
are the distinct times of the primary event and the competing risk
combined, Y
j
is the number at risk, d
1j
is the number of the primary events at time
t
j
, and d
2j
is the number of competing risks at time t
j
. Then the cumulative incidence
function of the primary event is estimated by

F
1
(t) =

tjt
_
d
1j
Y
j
_

ti<tj
_
Y
i
(d
1i
+d
2i
)
Y
i
_
Again the link function of interest could be cloglog corresponding to the regression
model for the competing risks cumulative incidence studied by Fine and Gray (1999).
3 The stpsurv, stpmean, and stpci commands
3.1 Syntax
Pseudovalues for the survival function, the mean survival time, and the cumulative
incidence function for competing risks are generated using the following syntaxes:
stpsurv
_
if
_
in

, at(numlist)
_
generate(string) failure

stpmean
_
if
_
in

, at(numlist)
_
generate(string) conditional

stpci varname
_
if
_
in

, at(numlist)
_
generate(string)

stpsurv, stpmean, and stpci are for use with st data. You must, therefore, stset
your data before issuing these commands. Frequency weights are allowed in the stset
command. In the stpci command for the cumulative incidence function in competing
risks, an indicator variable for the competing risks should always be specied. The
pseudovalues are by default stored in the pseudo variable when one time point is spec-
ied and are stored in variables pseudo1, pseudo2, . . . when several time points are
specied. The names of the pseudovariables are changed by the generate() option.
3.2 Options
at(numlist) species the time points in ascending order of which pseudovalues should
be computed. at() is required.
generate(string) species a variable name for the pseudo-observations. The default is
generate(pseudo).
failure generates pseudovalues for the cumulative incidence proportion, which is one
minus the survival function.
conditional species that pseudovalues for the conditional mean should be computed
instead of those for the restricted mean.
E. T. Parner and P. K. Andersen 413
4 Example data
To illustrate the pseudovalue approach, we use data on sibling-donor bone marrow
transplants matched on human leukocyte antigen (Copelan et al. 1991). The data
are available in Klein and Moeschberger (2003). The data include information on
137 transplant patients on time to death, relapse, or lost to follow-up (tdfs); the
indicators of relapse and death (relapse, trm); the indicator of treatment failure
(dfs = relapse | trm); and three factors that may be related to outcome: disease
[acute lymphocytic leukemia (ALL), low-risk acute myeloid leukemia (AML), and high-
risk AML], the FrenchAmericanBritish (FAB) disease grade for AML (fab = 1 if AML
and grade 4 or 5; 0 otherwise), and recipient age at transplant (age).
4.1 The survival function at a single time point
We will rst examine regression models for disease free survival at 530 days based on
the KaplanMeier estimator. Disease free survival probabilities for the single prognostic
factor FAB at 530 days (gure 1) can be compared using information obtained using the
Stata sts list command, which evaluates the KaplanMeier estimator.
0
.2
.4
.6
.8
1
P
r
o
b
a
b
i
l
i
t
y
0 500 1000 1500 2000
Time (days)
Fab=1
Fab=0
Survival
Figure 1. Disease free survival
Based on the sts list output below, the risk dierence (RD) for FAB is computed
as RD = 0.333 0.541 = 0.207 [95% condence interval: 0.379, 0.039] and the
relative risk (RR) for FAB is RR = 0.333/0.541 = 0.616, where FAB = 0 is chosen as the
reference group. The condence interval of the RD is based on computing the standard
error of the RD as (0.0522
2
+ 0.0703
2
)
1/2
. The condence interval for the RR is not
easily estimated using the information from the sts list command.
414 Pseudo-observations
. use bmt
. stset tdfs, failure(dfs==1)
failure event: dfs == 1
obs. time interval: (0, tdfs]
exit on or before: failure
137 total obs.
0 exclusions
137 obs. remaining, representing
83 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. sts list, at(0 530) by(fab)
failure _d: dfs == 1
analysis time _t: tdfs
Beg. Survivor Std.
Time Total Fail Function Error [95% Conf. Int.]
fab=0
0 0 0 1.0000 . . .
530 49 42 0.5408 0.0522 0.4334 0.6364
fab=1
0 0 0 1.0000 . . .
530 16 30 0.3333 0.0703 0.2018 0.4704
Note: survivor function is calculated over full data and evaluated at
indicated times; it is not calculated from aggregates shown at left.
Now we turn to the pseudovalues approach. We start by computing the pseudovalues
at 530 days using the stpsurv command. The pseudovalues are stored in the pseudo
variable.
. stpsurv, at(530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
The pseudovalues are analyzed in generalized linear models with an identity link
function and a log link function, respectively.
E. T. Parner and P. K. Andersen 415
. glm pseudo i.fab, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -96.989802
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
1.fab -.2080377 .0881073 -2.36 0.018 -.3807248 -.0353506
_cons .5406774 .0522411 10.35 0.000 .4382867 .6430681
. glm pseudo i.fab, link(log) vce(robust) eform noheader
Iteration 0: log pseudolikelihood = -123.14846
Iteration 1: log pseudolikelihood = -101.53512
Iteration 2: log pseudolikelihood = -96.991808
Iteration 3: log pseudolikelihood = -96.989802
Iteration 4: log pseudolikelihood = -96.989802
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
1.fab .6152278 .1440588 -2.07 0.038 .3887968 .9735298
The generalized linear models with an identity link function and a log link function
t the relations
p
i
= E (X
i
) =
0
+
1
FAB
i
log(p
i
) = log{E (X
i
)} =

0
+

1
FAB
i
respectively, where p
i
= S
i
(530) is disease free survival probability at 530 days for
individual i. Hence, based on the pseudovalues approach, we estimate the RD for FAB
by RD = 0.208 [95% condence interval: 0.381, 0.035] and the RR for FAB by
RR = 0.615 [95% condence interval: 0.389, 0.974]. The results are very similar to the
direct computation from the KaplanMeier using the sts list command. We now
obtain the condence interval for the RR.
Suppose we wish to compute the RR for FAB, adjusting for disease as a categorical
variable and age as a continuous variable. Using the same pseudovalues, we t the
generalized linear model.
(Continued on next page)
416 Pseudo-observations
. glm pseudo i.fab i.disease age, link(log) vce(robust) eform noheader
Iteration 0: log pseudolikelihood = -114.83229
Iteration 1: log pseudolikelihood = -93.440112
Iteration 2: log pseudolikelihood = -88.620704
Iteration 3: log pseudolikelihood = -88.601028
Iteration 4: log pseudolikelihood = -88.601013
Iteration 5: log pseudolikelihood = -88.601013
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
1.fab .6322634 .1665066 -1.74 0.082 .3773412 1.059405
disease
2 1.951343 .412121 3.17 0.002 1.289914 2.951931
3 1.005533 .3586364 0.02 0.988 .4998088 2.022965
age .9856265 .0080274 -1.78 0.075 .970018 1.001486
Patients with AML and grade 4 or 5 (FAB = 1) have a 27% reduced disease free
survival probability at 530 days, when adjusting for disease and age.
4.2 The survival function at several time points
In this example, we compute pseudovalues at ve data points roughly equally spaced on
the event scale: 50, 105, 170, 280, and 530 days. To t the model log[log{S(t | Z)}] =
log{
0
(t)}+Z, we can use the cloglog link on the pseudovalues on failure probabilities;
that is, we t a Cox regression model for the ve time points simultaneously.
. stpsurv, at(50 105 170 280 530) failure
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long
Number of obs. 137 -> 685
Number of variables 32 -> 29
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo
E. T. Parner and P. K. Andersen 417
. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id) noheader
Iteration 0: log pseudolikelihood = -468.74476
Iteration 1: log pseudolikelihood = -457.41878 (not concave)
Iteration 2: log pseudolikelihood = -406.98781
Iteration 3: log pseudolikelihood = -365.23278
Iteration 4: log pseudolikelihood = -350.7435
Iteration 5: log pseudolikelihood = -349.97156
Iteration 6: log pseudolikelihood = -349.96409
Iteration 7: log pseudolikelihood = -349.96409
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
times
2 1.114256 .3269323 3.41 0.001 .4734805 1.755032
3 1.626173 .3567925 4.56 0.000 .9268721 2.325473
4 2.004267 .3707305 5.41 0.000 1.277649 2.730885
5 2.495327 .3824645 6.52 0.000 1.745711 3.244944
1.fab .7619547 .354821 2.15 0.032 .0665183 1.457391
disease
2 -1.195542 .4601852 -2.60 0.009 -2.097489 -.2935959
3 .0036343 .3791488 0.01 0.992 -.7394838 .7467524
age .0130686 .0146629 0.89 0.373 -.0156702 .0418074
_cons -2.981582 .6066311 -4.91 0.000 -4.170557 -1.792607
The estimated survival function in this model for a patient at time t with a set of
covariates Z is S(t) = exp{
0
(t)e
Z
}, where

0
(50) = exp(2.9816) = 0.051

0
(105) = exp(2.9816 + 1.1143) = 0.155

0
(170) = exp(2.9816 + 1.6262) = 0.258

0
(280) = exp(2.9816 + 2.0043) = 0.376

0
(530) = exp(2.9816 + 2.4953) = 0.615
The model shows that patients with AML who are at low risk have better disease
free survival than ALL patients [RR = exp(1.1955) = 0.30] and that AML patients with
grade 4 or 5 FAB have a lower disease free survival [RR = exp(0.7620) = 2.14].
Without recomputing the pseudovalues, we can examine the eect of FAB over time.
(Continued on next page)
418 Pseudo-observations
. generate fab50=(fab==1 & times==1)
. generate fab105=(fab==1 & times==2)
. generate fab170=(fab==1 & times==3)
. generate fab280=(fab==1 & times==4)
. generate fab530=(fab==1 & times==5)
. glm pseudo i.times fab50-fab530 i.disease age, link(cloglog) vce(cluster id)
> noheader eform
Iteration 0: log pseudolikelihood = -471.86839
Iteration 1: log pseudolikelihood = -464.24832 (not concave)
Iteration 2: log pseudolikelihood = -406.31257
Iteration 3: log pseudolikelihood = -361.28364
Iteration 4: log pseudolikelihood = -349.90468
Iteration 5: log pseudolikelihood = -349.44613
Iteration 6: log pseudolikelihood = -349.43492
Iteration 7: log pseudolikelihood = -349.43485
Iteration 8: log pseudolikelihood = -349.43485
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
2 3.99608 2.023867 2.74 0.006 1.480921 10.78292
3 8.225489 4.601898 3.77 0.000 2.747526 24.62531
4 11.89654 6.835021 4.31 0.000 3.858093 36.68333
5 19.20116 11.25862 5.04 0.000 6.084498 60.59409
fab50 4.047315 3.227324 1.75 0.080 .8480474 19.31586
fab105 2.866106 1.433666 2.11 0.035 1.07525 7.639677
fab170 2.008426 .795497 1.76 0.078 .9240856 4.365155
fab280 2.022028 .7258472 1.96 0.050 1.000533 4.086419
fab530 2.048864 .7838364 1.87 0.061 .9679838 4.33669
disease
2 .3024683 .1368087 -2.64 0.008 .1246451 .7339808
3 .9993425 .3815547 -0.00 0.999 .4728471 2.112069
age 1.012745 .0148835 0.86 0.389 .9839899 1.04234
. test fab50=fab105=fab170=fab280=fab530
( 1) [pseudo]fab50 - [pseudo]fab105 = 0
( 2) [pseudo]fab50 - [pseudo]fab170 = 0
( 3) [pseudo]fab50 - [pseudo]fab280 = 0
( 4) [pseudo]fab50 - [pseudo]fab530 = 0
chi2( 4) = 1.73
Prob > chi2 = 0.7855
The model shows that there is no statistically signicant dierence in the FAB eect
over time (p = 0.79); that is, proportional hazards are not contraindicated for FAB.
E. T. Parner and P. K. Andersen 419
4.3 The restricted mean
For the restricted mean time to treatment failure, we use the stpmean command. To
illustrate, we look at a regression model for the mean time to treatment failure restricted
to 1,500 days. Here we use the identity link function.
. stpmean, at(1500)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
. glm pseudo i.fab i.disease age, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -1065.6767
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
1.fab -352.0442 123.311 -2.85 0.004 -593.7293 -110.359
disease
2 461.1214 134.0932 3.44 0.001 198.3036 723.9391
3 78.00616 158.8357 0.49 0.623 -233.3061 389.3184
age -8.169236 5.060915 -1.61 0.106 -18.08845 1.749976
_cons 895.118 159.1586 5.62 0.000 583.173 1207.063
Here we see that low-risk AML patients have the longest restricted mean life, namely,
461.1 days longer than ALL patients within 1,500 days.
4.4 Competing risks
For the cumulative incidence function, we use the stpci command to compute the
pseudovalues. To illustrate, we use the complementary loglog model to the relapse
cumulative incidence evaluated at 50, 105, 170, 280, and 530 days. The event of interest
is death in remission. Here relapse is a competing event.
. stset tdfs, failure(trm==1)
failure event: trm == 1
obs. time interval: (0, tdfs]
exit on or before: failure
137 total obs.
0 exclusions
137 obs. remaining, representing
42 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. generate compet=(trm==0 & relapse==1)
420 Pseudo-observations
. stpci compet, at(50 105 170 280 530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long
Number of obs. 137 -> 685
Number of variables 33 -> 30
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo
. fvset base none times
. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id)
> noheader noconst eform
Iteration 0: log pseudolikelihood = -462.96735 (not concave)
Iteration 1: log pseudolikelihood = -348.27329
Iteration 2: log pseudolikelihood = -221.69131
Iteration 3: log pseudolikelihood = -198.31467
Iteration 4: log pseudolikelihood = -197.38196
Iteration 5: log pseudolikelihood = -197.37526
Iteration 6: log pseudolikelihood = -197.37524
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
1 .0286012 .0292766 -3.47 0.001 .0038467 .21266
2 .0791623 .0547411 -3.67 0.000 .0204131 .306993
3 .1261608 .0823572 -3.17 0.002 .0350965 .4535083
4 .1781601 .1117597 -2.75 0.006 .0521017 .6092124
5 .2383869 .1488814 -2.30 0.022 .0700932 .8107537
1.fab 3.104153 1.52811 2.30 0.021 1.182808 8.146518
disease
2 .1708985 .1154623 -2.61 0.009 .0454622 .6424309
3 .7829133 .466016 -0.41 0.681 .2438093 2.514068
age 1.014382 .0258272 0.56 0.575 .9650037 1.066286
Here we are modeling C(t | Z) = 1 exp{
0
(t)e
Z
}. Positive values of for a
covariate suggest a larger cumulative incidence for patients with Z = 1. The model
suggests that the low-risk AML patients have the smallest risk of death in remission and
the AML FAB 4/5 patients have the highest risk of death in remission.
E. T. Parner and P. K. Andersen 421
5 Conclusion
The pseudovalue method is a versatile tool for regression analysis of censored time-to-
event data. We have implemented the method for regression analysis of the survival
under right-censoring, for the cumulative incidence function under possible competing
risks, and for the restricted and conditional mean waiting time. Similar SAS macros and
R functions were presented by Klein et al. (2008).
6 References
Andersen, P. K., M. G. Hansen, and J. P. Klein. 2004. Regression analysis of restricted
mean survival time based on pseudo-observations. Lifetime Data Analysis 10: 335
350.
Andersen, P. K., and J. P. Klein. 2007. Regression analysis for multistate models
based on a pseudo-value approach, with applications to bone marrow transplantation
studies. Scandinavian Journal of Statistics 34: 316.
Andersen, P. K., J. P. Klein, and S. Rosthj. 2003. Generalised linear models for
correlated pseudo-observations, with applications to multi-state models. Biometrika
90: 1527.
Andersen, P. K., and M. P. Perme. 2010. Pseudo-observations in survival analysis.
Statistical Methods in Medical Research 19: 7199.
Copelan, E. A., J. C. Biggs, J. M. Thompson, P. Crilley, J. Szer, J. P. Klein, N. Kapoor,
B. R. Avalos, I. Cunningham, K. Atkinson, K. Downs, G. S. Harmon, M. B. Daly,
I. Brodsky, S. I. Bulova, and P. J. Tutschka. 1991. Treatment for acute myelocytic
leukemia with allogeneic bone marrow transplantation following preparation with
BuCy2. Blood 78: 838843.
Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496509.
Graw, F., T. A. Gerds, and M. Schumacher. 2009. On pseudo-values for regression
analysis in competing risks models. Lifetime Data Analysis 15: 241255.
Kaplan, E. L., and P. Meier. 1958. Nonparametric estimation from incomplete obser-
vations. Journal of the American Statistical Association 53: 457481.
Klein, J. P. 2006. Modeling competing risks in cancer studies. Statistics in Medicine
25: 10151034.
Klein, J. P., and P. K. Andersen. 2005. Regression modeling of competing risks data
based on pseudovalues of the cumulative incidence function. Biometrics 61: 223229.
Klein, J. P., M. Gerster, P. K. Andersen, S. Tarima, and M. P. Perme. 2008. SAS and R
functions to compute pseudo-values for censored data regression. Computer Methods
and Programs in Biomedicine 89: 289300.
422 Pseudo-observations
Klein, J. P., B. Logan, M. Harho, and P. K. Andersen. 2007. Analyzing survival curves
at a xed point in time. Statistics in Medicine 26: 45054519.
Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd ed. New York: Springer.
Liang, K.-Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear
models. Biometrika 73: 1322.
Perme, M. P., and P. K. Andersen. 2008. Checking hazard regression models using
pseudo-observations. Statistics in Medicine 27: 53095328.
About the authors
Erik T. Parner has a PhD in statistics from the University of Aarhus. He is an associate profes-
sor of biostatistics at the University of Aarhus. His research elds are time-to-event analysis,
statistical methods in epidemiology and genetics, and the etiology and changing prevalence of
autism.
Per K. Andersen has a PhD in statistics and a DrMedSci degree in biostatistics, both from the
University of Copenhagen. He is a professor of biostatistics at the University of Copenhagen.
His main research elds are time-to-event analysis and statistical methods in epidemiology.
The Stata Journal (2010)
10, Number 3, pp. 423457
Estimation of quantile treatment eects with
Stata
Markus Fr olich
Universitat Mannheim and
Institute for the Study of Labor
Bonn, Germany
froelich@uni-mannheim.de
Blaise Melly
Department of Economics
Brown University
Providence, RI
blaise melly@brown.edu
Abstract. In this article, we discuss the implementation of various estimators
proposed to estimate quantile treatment eects. We distinguish four cases involv-
ing conditional and unconditional quantile treatment eects with either exogenous
or endogenous treatment variables. The introduced ivqte command covers four
dierent estimators: the classical quantile regression estimator of Koenker and
Bassett (1978, Econometrica 46: 3350) extended to heteroskedasticity consis-
tent standard errors; the instrumental-variable quantile regression estimator of
Abadie, Angrist, and Imbens (2002, Econometrica 70: 91117); the estimator for
unconditional quantile treatment eects proposed by Firpo (2007, Econometrica
75: 259276); and the instrumental-variable estimator for unconditional quantile
treatment eects proposed by Frolich and Melly (2008, IZA discussion paper 3288).
The implemented instrumental-variable procedures estimate the causal eects for
the subpopulation of compliers and are only well suited for binary instruments.
ivqte also provides analytical standard errors and various options for nonpara-
metric estimation. As a by-product, the locreg command implements local linear
and local logit estimators for mixed data (continuous, ordered discrete, unordered
discrete, and binary regressors).
Keywords: st0203, ivqte, locreg, quantile treatment eects, nonparametric regres-
sion, instrumental variables
1 Introduction
Ninety-ve percent of applied econometrics is concerned with mean eects, yet distri-
butional eects are no less important. The distribution of the dependent variable may
change in many ways that are not revealed or are only incompletely revealed by an exam-
ination of averages. For example, the wage distribution can become more compressed or
the upper-tail inequality may increase while the lower-tail inequality decreases. There-
fore, applied economists and policy makers are increasingly interested in distributional
eects. The estimation of quantile treatment eects (QTEs) is a powerful and intuitive
tool that allows us to discover the eects on the entire distribution. As an alternative
motivation, median regression is often preferred to mean regression to reduce suscep-
tibility to outliers. Hence, the estimators presented below may thus be particularly
appealing with noisy data such as wages or earnings. In this article, we provide a brief
survey over recent developments in this literature and a description of the new ivqte
command, which implements these estimators.
c 2010 StataCorp LP st0203
424 Estimation of quantile treatment eects with Stata
Depending on the type of endogeneity of the treatment and the denition of the
estimand, we can dene four dierent cases. We distinguish between conditional and
unconditional eects and whether selection is on observables or on unobservables. Con-
ditional QTEs are dened conditionally on the value of the regressors, whereas uncon-
ditional eects summarize the causal eect of a treatment for the entire population.
Selection on observables is often referred to as a matching assumption or as exogenous
treatment choice (that is, exogenous conditional on X). In contrast, we refer to selection
on unobservables as endogenous treatment choice.
First, if we are interested in conditional QTEs and we assume that the treatment
is exogenous (conditional on X), we can use the quantile regression estimators pro-
posed by Koenker and Bassett (1978). Second, if we are interested in conditional
QTEs but the treatment is endogenous, the instrumental-variable (IV) estimator of
Abadie, Angrist, and Imbens (2002) may be applied. Third, for estimating uncondi-
tional QTEs with exogenous treatment, various approaches have been suggested, for
example, Firpo (2007), Fr olich (2007a), and Melly (2006). Currently, the weighting
estimator of Firpo (2007) is implemented. Finally, unconditional QTE in the presence
of an endogenous treatment can be estimated with the technique of Fr olich and Melly
(2008). The estimators for the unconditional treatment eects do not rely on any (para-
metric) functional forms assumptions. On the other hand, for the conditional treatment
eects,

n convergence rate can only be obtained with a parametric restriction. Be-
cause estimators aected by the curse of dimensionality are of less interest to the applied
economist, we will discuss only parametric (linear) estimators for estimating conditional
QTEs.
The implementation of most of these estimators requires the preliminary nonpara-
metric estimation of some kind of (instrument) propensity scores. We use nonparametric
linear and logistic regressions to estimate these propensity scores. As a by-product, we
also oer the locreg command for researchers interested only in these nonparametric
regression estimators. We allow for dierent types of regressors, including continuous,
ordered discrete, unordered discrete, and binary variables. A cross-validation routine is
implemented for choosing the smoothing parameters.
This article only discusses the implementation of the proposed estimators and the
syntax of the commands. It draws heavily on the more technical discussion in the
original articles, and the reader is referred to those articles for more background on,
and formal derivations of, some of the properties of the estimators described here.
The contributions to this article and the related commands are manyfold. We provide
new standardized commands for the estimators proposed in Abadie, Angrist, and Imbens
(2002);
1
Firpo (2007); and Fr olich and Melly (2008); and estimators of their analytical
standard errors. For the conditional exogenous case, we provide heteroskedasticity
consistent standard errors. The estimator of Koenker and Bassett (1978) has already
been implemented in Stata with the qreg command, but its estimated standard errors
1. Joshua Angrist provides codes in Matlab to replicate the empirical results of Abadie, Angrist, and
Imbens (2002). Our codes for this estimator partially build on his codes.
M. Fr olich and B. Melly 425
are not consistent in the presence of heteroskedasticity. The ivqte command thus
extends upon qreg in providing analytical standard errors for heteroskedastic errors.
At a higher level, locreg implements nonparametric estimation with both cate-
gorical and continuous regressors as suggested by Racine and Li (2004). Finally, we
incorporate cross-validation procedures to choose the smoothing parameters.
The next section outlines the denition of the estimands, the possible identica-
tion approaches, and the estimators. Section 3 describes the ivqte command and
its various options, and contains simple applications to illustrate how ivqte can be
used. Appendix A describes somewhat more technical aspects for the estimation of the
asymptotic variance matrices. Appendix B describes the nonparametric estimators used
internally by ivqte and the additional locreg command.
2 Framework, assumptions, and estimators
We consider the eect of a binary treatment variable D on a continuous outcome variable
Y . Let Y
1
i
and Y
0
i
be the potential outcomes of individual i. Hence, Y
1
i
would be realized
if individual i were to receive treatment 1, and Y
0
i
would be realized otherwise. Y
i
is
the observed outcome, which is Y
i
Y
1
i
D
i
+Y
0
i
(1 D
i
).
In this article, we identify and estimate the entire distribution functions of Y
1
and
Y
0
.
2
Because QTEs are an intuitive way to summarize the distributional impact of a
treatment, we focus our attention especially on them.
We often observe not only the outcome and the treatment variables but also some
characteristics X.
3
We can therefore either dene the QTEs conditionally on the co-
variates or unconditionally. In addition, we have to deal with endogenous treatment
choice. We distinguish between the case where selection is only on observables and the
case where selection is also on unobservables.
2.1 Conditional exogenous QTEs
We start with the standard model for linear quantile regression, which is a model for
conditional eects and where one assumes selection on observables. We assume that Y
is a linear function in X and D.
Assumption 1. Linear model for potential outcomes
Y
d
i
= X
i

+d

+
i
and Q

i
= 0
for i = 1, . . . , n and d (0, 1). Q

i
refers to the th quantile of the unobserved random
variable
i
.

and

are the unknown parameters of the model. Here

represents
the conditional QTEs at quantile .
2. In the case with endogenous treatment, we identify the potential outcomes only for compliers, as
dened later.
3. If we do not observe covariates, then conditional and unconditional QTEs are identical and the
estimators simplify accordingly.
426 Estimation of quantile treatment eects with Stata
Clearly, this linearity assumption is not sucient for identication of QTEs because
the observed D
i
may be correlated with the error term
i
. We assume that both D and
X are exogenous.
Assumption 2. Selection on observables with exogenous X
(D, X)
Assumptions 1 and 2 together imply that Q

Y |X,D
= X

+ D

, such that we can


recover the unknown parameters of the potential outcomes from the joint distribution
of the observed variables Y , X, and D. The unknown coecients can thus be estimated
by the classical quantile regression estimator suggested by Koenker and Bassett (1978).
This estimator is dened by
(

) = arg min
,

(Y
i
X
i
D
i
) (1)
where

(u) = u { 1 (u < 0)}. This is a convex linear programming problem


and is solved rather eciently by the built-in qreg command in Stata. The ivqte
command produces exactly the same point estimates as does qreg. In contrast to qreg,
however, ivqte produces analytical standard errors that are consistent also in the case
of heteroskedasticity.
To illustrate the similarity to all the following estimators, we could also write the
previous expression as
(

) = arg min
,

W
KB
i

(Y
i
X
i
D
i
)
where the weights W
KB
i
are all equal to one.
2.2 Conditional endogenous QTEs
In many applications, the treatment D is self selected and potentially endogenous. We
may not be able to observe all covariates to make assumption 2 valid. In this case,
the traditional quantile regression estimator will be biased, and we need to use an IV
identication strategy to recover the true eects. We assume that we observe a binary
instrument Z and can therefore dene two potential treatments denoted by D
z
.
4
We
use the following IV assumption as in Abadie, Angrist, and Imbens (2002).
5
4. If the instrument is nonbinary, it must be transformed into a binary variable. See Frolich and Melly
(2008).
5. An alternative approach is given in Chernozhukov and Hansen (2005), who rely on a monotonic-
ity/rank invariance assumption in the outcome equation.
M. Fr olich and B. Melly 427
Assumption 3. IV
For almost all values of X
_
Y
0
, Y
1
, D
0
, D
1
_
Z |X
0 < Pr (Z = 1 |X) < 1
E (D
1
|X) = E (D
0
|X)
Pr (D
1
D
0
|X) = 1
This assumption is well known and requires monotonicity (that is, the nonexistence of
deers) in addition to a conditional independence assumption on the IV. Individuals
with D
1
> D
0
are referred to as compliers, and treatment eects can be identied only
for this group because the always- and never-participants cannot be induced to change
treatment status by hypothetical movements of the instrument.
Abadie, Angrist, and Imbens (2002) (AAI) impose assumption 3. Furthermore, they
require assumption 1 to hold for the compliers (that is, those observations with D
1
>
D
0
). They show that the conditional QTE,

, for the compliers can be estimated


consistently by the weighted quantile regression:
(

IV
,

IV
) = arg min
,

W
AAI
i

(Y
i
X
i
D
i
) (2)
W
AAI
i
= 1
D
i
(1 Z
i
)
1 Pr (Z = 1 |X
i
)

(1 D
i
) Z
i
Pr (Z = 1 |X
i
)
The intuition for these weights can be given in two steps. First, by assumption 3,
6
_
Y
0
, Y
1
, D
0
, D
1
_
Z |X
=
_
Y
0
, Y
1
_
Z |X, D
1
> D
0
=
_
Y
0
, Y
1
_
D|X, D
1
> D
0
This means that any observed relationship between D and Y has a causal interpretation
for compliers. To use this result, we have to nd compliers in the population. This is
done in the following average sense by the weights W
AAI
i
:
7
E
_
W
AAI
i

(Y
i
X
i
D
i
)
_
= Pr (D
1
> D
0
) E {

(Y
i
X
i
D
i
) |D
1
> D
0
}
Intuitively, this result holds because W
AAI
i
= 1 for the compliers and because
E
_
W
AAI
i
|D
i,1
= D
i,0
= 0
_
= E
_
W
AAI
i
|D
i,1
= D
i,0
= 1
_
= 0.
A preliminary estimator for Pr (Z = 1 |X
i
) is needed to implement this estima-
tor. ivqte uses the local logit estimator described in appendix B.
8
A problem with
6. This is the result of lemma 2.1 in Abadie, Angrist, and Imbens (2002).
7. This is a special case of theorem 3.1.a in Abadie (2003).
8. In their original article, Abadie, Angrist, and Imbens (2002) use a series estimator instead of a
local estimator as in ivqte. Nevertheless, one can also use series estimation or, in fact, any other
method to estimate the propensity score by rst generating a variable containing the estimated
propensity score and informing ivqte via the phat() option that the propensity-score estimate is
supplied by the user.
428 Estimation of quantile treatment eects with Stata
estimator (2) is that the optimization problem is not convex because some of the
weights are negative while others are positive. Therefore, this estimator has not been
implemented. Instead, ivqte implements the AAI estimator with positive weights.
Abadie, Angrist, and Imbens (2002) have shown that as an alternative to W
AAI
i
, one
can use the weights
W
AAI+
i
= E
_
W
AAI
|Y
i
, D
i
, X
i
_
(3)
instead, which are always positive. Because these weights are unknown, ivqte uses local
linear regression to estimate W
AAI+
i
; see appendix B. Some of these estimated weights
might be negative in nite samples, which are then set to zero.
9
2.3 Unconditional QTEs
The two estimators presented above focused on conditional treatment eects, that is,
conditional on a set of variables X. We will now consider unconditional QTEs, which
have some advantages over the conditional eects. The unconditional QTE (for quantile
) is given by

= Q

Y
1 Q

Y
0
First, the denition of the unconditional QTE does not change when we change the
set of covariates X. Although we aim to estimate the unconditional eect, we still
use the covariates X for two reasons. On the one hand, we often need covariates to
make the identication assumptions more plausible. On the other hand, covariates can
increase eciency. Therefore, covariates X are included in the rst-step regression and
then integrated out. However, the denition of the eects is not a function of the
covariates. This is an advantage over the conditional QTE, which changes with the set
of conditioning variables even if the covariates are not needed to satisfy the selection on
observables or the IV assumptions.
A very simple example illustrates this advantage. Assume that the treatment D
has been completely randomized and is therefore independent both from the potential
outcomes as well as from the covariates. A simple comparison of the distribution of Y in
the treated and nontreated populations has a causal interpretation in such a situation.
For eciency reasons, however, we may wish to include covariates in the estimation. If
we are interested in mean eects, it is well known that including in a linear regression
covariates that are independent from the treatment leaves the estimated treatment eect
asymptotically unchanged. This property is lost for QTEs! Including covariates that
are independent from the treatment can change the limit of the estimated conditional
QTEs. On the other hand, it does not change the unconditional treatment eects if
the assumptions of the model are satised for both sets of covariates, which is trivially
satised in our randomized example.
A second advantage of unconditional eects is that they can be estimated consis-
tently at the

n rate without any parametric restrictions, which is not possible for
conditional eects. For the conditional QTE, we therefore only implemented estimators
9. Again, other estimators may be used with ivqte. The weights are rst estimated by the user and
then supplied via the what() option.
M. Fr olich and B. Melly 429
with a parametric restriction. The following estimators of the unconditional QTE are
entirely nonparametric, and we will no longer invoke assumption 1. This is an important
advantage because parametric restrictions are often dicult to justify from a theoretical
point of view. In addition, assumption 1 restricts the QTE to be the same independently
from the value of X. Obviously, interaction terms may be included, but the eects in
the entire population are often more interesting than many eects for dierent covariate
combinations.
The interpretation of the unconditional eects is slightly dierent from the interpre-
tation of the conditional eects, even if the conditional QTE is independent from the
value of X. This is because of the denition of the quantile. For instance, if we are in-
terested in a low quantile, the conditional QTE will summarize the eect for individuals
with relatively low Y even if their absolute level of Y is high. The unconditional QTE,
on the other hand, will summarize the eect with a relatively low absolute Y .
Finally, the conditional and unconditional QTEs are trivially the same in the absence
of covariates. They are also the same if the eect is the same independent of the value
of the covariates and of the value of the quantile . This is often called the location
shift model because the treatment aects only the location of the distribution of the
potential outcomes.
2.4 Unconditional endogenous QTEs
We consider rst the case of an endogenous treatment with a binary IV Z. This includes
the situation with exogenous treatment as a special case when we use Z = D.
Fr olich and Melly (2008) showed that

for the compliers is identied under a


somewhat weaker version of assumption 3, and they proposed the following estimator:
(
IV
,

IV
) = arg min
,

W
FM
i

(Y
i
D
i
) (4)
W
FM
i
=
Z
i
Pr (Z = 1 |X
i
)
Pr (Z = 1 |X
i
) {1 Pr (Z = 1 |X
i
)}
(2D
i
1)
This is a bivariate quantile regressor estimator with weights. One can easily see that

IV
+

IV
is identied only from the D = 1 observations and that
IV
is identied
only from the D = 0 observations. Therefore, this estimator is equivalent to using
two univariate weighted quantile regressions separately for the D = 1 and the D = 0
observations.
10
There are two dierences between (4) and (2): The covariates are not included in the
weighted quantile regression in (4), and the weights are dierent.
11
One might be think-
10. The previous expression is numerically identical to b
IV
= arg min
q
0
P
i:D
i
=0
W
FM
i
(Y
i
q
0
) and
b
IV
+
b

IV
= arg min
q
1
P
i:D
i
=1
W
FM
i
(Y
i
q
1
), from which we thus obtain
b

IV
via two univariate
quantile regressions.
11. The weights W
FM
i
were suggested in theorem 3.1.b and 3.1.c of Abadie (2003) for a general purpose.
Frolich and Melly (2008) used these weights to estimate unconditional QTEs.
430 Estimation of quantile treatment eects with Stata
ing about running a weighted quantile regression of Y on a constant and D by using the
weights W
AAI
i
. For that purpose, however, the weights of Abadie, Angrist, and Imbens
(2002) are not correct as shown in Fr olich and Melly (2008). This estimator would es-
timate the dierence between the quantile of Y
1
for the treated compliers and the
quantile of Y
0
for the nontreated compliers, which is not meaningful in general. How-
ever, weights W
AAI
i
could be used to estimate unconditional eects in the special case
when the IV is independent of X such that Pr (Z = 1 |X) is not a function of X.
On the other hand, if one is interested in estimating conditional QTE using a para-
metric specication, the weights W
FM
i
could be used, as well. Hence, although not
developed for this case, the weights W
FM
i
can be used to identify conditional QTEs. It
is not clear whether W
FM
i
or W
AAI
i
will be more ecient. For estimating conditional
eects, both are inecient anyway because they do not incorporate the conditional
density function of the error term at the quantile.
Intuitively, the dierence between the weights W
AAI
i
and W
FM
i
can be explained as
follows: They both nd the compliers in the average sense discussed above. However,
only W
FM
i
simultaneously balances the distribution of the covariates between treated
and nontreated compliers. Therefore, W
AAI
i
can be used only in combination with a
conditional model because there is no need to balance covariates in such a case. It can
also be used without a conditional model when the treated and nontreated compliers
have the same covariate distribution. W
FM
i
, on the other hand, can be used with or
without a conditional model.
A preliminary estimator for Pr (Z = 1 |X
i
) is needed to implement this estimator.
ivqte uses the local logit estimator described in appendix B. The optimization problem
(4) is neither convex nor smooth. However, only two parameters have to be estimated.
In fact, one can easily show that the estimator can be written as two univariate quantile
regressions, which can easily be solved despite the nonsmoothness; see the previous
footnotes. This is the way ivqte proceeds when the positive option is not activated.
12
An alternative to solving this nonconvex problem consists in using the weights
W
FM+
i
= E
_
W
FM
|Y
i
, D
i
_
(5)
which are always positive. ivqte estimates these weights by local linear regression if
the positive option has been activated. Again, estimated negative weights will be set
to zero.
13
12. More precisely, ivqte solves the convex problem for the distribution function, and then mono-
tonizes the estimated distribution function using the method of Chernozhukov, Fernandez-Val, and
Galichon (2010), and nally inverts it to obtain the quantiles. The parameters chosen in this way
solve the rst-order conditions of the optimization problem, and therefore, the asymptotic results
apply to them.
13. If one is interested in average treatment eects, Frolich (2007b) has proposed an estimator for aver-
age treatment eects based on the same set of assumptions. This estimator has been implemented
in Stata in the command nplate, which can be downloaded from the websites of the authors of
this article.
M. Fr olich and B. Melly 431
2.5 Unconditional exogenous QTEs
Finally, we consider the case where the treatment is exogenous, conditional on X. We
assume that X contains all confounding variables, which we denote as the selection on
observables assumption. We also have to assume that the support of the covariates is
the same independent of the treatment, because in a nonparametric model, we cannot
extrapolate the conditional distribution outside the support of the covariates.
Assumption 4. Selection on observables and common support
(Y
0
, Y
1
)D|X
0 < Pr (D = 1 |X) < 1
Assumption 4 identies the unconditional QTE, as shown in Firpo (2007), Fr olich
(2007a), and Melly (2006). The estimator of Firpo (2007) is a special case of (4), when
D is used as its own instrument. The weighting estimator for

therefore is
( ,

) = arg min
,

W
F
i

(Y
i
D
i
) (6)
W
F
i
=
D
i
Pr (D = 1 |X
i
)
+
1 D
i
1 Pr (D = 1 |X
i
)
This is a traditional propensity-score weighting estimator, also known as inverse
probability weighting. A preliminary estimator for Pr (D = 1 |X
i
) is needed to imple-
ment this estimator. ivqte uses the local logit estimator described in appendix B.
3 The ivqte command
3.1 Syntax
The syntax of ivqte is as follows:
ivqte depvar
_
indepvars

(treatment
_
= instrument

)
_
if
_
in
_
,
quantiles(numlist) continuous(varlist) dummy(varlist) unordered(varlist)
aai linear mata opt kernel(kernel) bandwidth(#) lambda(#) trim(#)
positive pbandwidth(#) plambda(#) pkernel(kernel) variance
vbandwidth(#) vlambda(#) vkernel(kernel) level(#)
generate p(newvarname
_
, replace

) generate w(newvarname
_
,
replace

) phat(varname) what(varname)

432 Estimation of quantile treatment eects with Stata


3.2 Description
ivqte computes the QTEs of a binary variable using a weighting strategy. This com-
mand can estimate both conditional and unconditional QTEs under either exogene-
ity or endogeneity. The estimator proposed by Fr olich and Melly (2008) is used if
unconditional QTEs under endogeneity are estimated. The estimator proposed by
Abadie, Angrist, and Imbens (2002) is used if conditional QTEs under endogeneity are
estimated. The estimator proposed by Firpo (2007) is used if unconditional QTEs under
exogeneity are estimated. The estimator proposed by Koenker and Bassett (1978) is
used if conditional QTEs under exogeneity are estimated.
The estimator used by ivqte is determined as follows:
If an instrument is provided and aai is not activated, the estimator proposed by
Fr olich and Melly (2008) is used.
If an instrument is provided and aai is activated, the estimator proposed by
Abadie, Angrist, and Imbens (2002) is used.
If there is no instrument and indepvars is empty, the estimator proposed by Firpo
(2007) is used.
If there is no instrument and indepvars contains variables, the estimator proposed
by Koenker and Bassett (1978) is used.
indepvars contains the list of X variables for the Koenker and Bassett (1978) estima-
tor for the estimation of exogenous conditional QTEs.
14
For all other estimators, indep-
vars must remain empty, and the control variables X are to be given in continuous(),
unordered(), and dummy(). The instrument or the treatment variable is assumed to
satisfy the exclusion restriction conditionally on these variables.
The IV has to be provided as a binary variable, taking only the values 0 and 1. If the
original IV takes dierent values, it rst has to be transformed to a binary variable. If
the original IV is one-dimensional, one may use the endpoints of its support and discard
the other observations. If one has several discrete IVs, one would use only those two
combinations that maximize and minimize the treatment probability Pr(D = 1|Z = z)
and code these two values as 0 and 1. For more details on how to transform several
nonbinary IVs to this binary case, see Fr olich and Melly (2008).
The estimation of all nonparametric functions is described in detail in appendix B.
A mixed kernel as suggested by Racine and Li (2004) is used to smooth over the con-
tinuous and categorical data. The more conventional approach of estimating the regres-
sion plane inside each cell dened by the discrete variables can be followed by setting
lambda() to 0. The propensity score is estimated by default by a local logit estima-
tor. A local linear estimator is used if the linear option is selected. Two algorithms
14. For the Koenker and Bassett (1978) estimator, the options continuous(), unordered(),
and dummy() are not permitted.
M. Fr olich and B. Melly 433
are available to maximize the local logistic likelihood function. The default is a simple
GaussNewton algorithm written for this purpose. If you select the mata opt option,
the ocial Stata 10 optimizer is used. We expect the ocial estimator to be more stable
in dicult situations. However, it can be used only if you have Stata 10 or more recent
versions.
The ivqte command also requires the packages moremata (Jann 2005b) and kdens
(Jann 2005a).
3.3 Options
Model
quantiles(numlist) species the quantiles at which the eects are estimated and should
contain numbers between 0 and 1. The computational time needed to estimate an
additional quantile is very short compared with the time needed to estimate the
preliminary nonparametric regressions. When conditional QTEs are estimated, only
one quantile may be specied. If one is interested in several QTEs, then one can save
the estimated weights for later use by using the generate w() option. By default,
quantiles() is set to 0.5 when conditional QTEs are estimated, and quantiles()
contains the nine deciles from 0.1 to 0.9 when unconditional QTEs are estimated.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous. For all estimators except Koenker and Bassett (1978), the X variables
should be given here and not in indepvars. For the Koenker and Bassett (1978)
estimator, on the other hand, these options are not permitted and the X variables
must be given in indepvars.
aai selects the Abadie, Angrist, and Imbens (2002) estimator.
With the exception of the Koenker and Bassett (1978) estimator, several further
options are needed to control the estimation of the nonparametric components. First,
we need to estimate some kind of propensity score. For the Firpo (2007) estimator,
we need to estimate Pr(D = 1|X). For the Abadie, Angrist, and Imbens (2002) and
Fr olich and Melly (2008) estimators, we need to estimate Pr(Z = 1|X), which we
also call a propensity score in the following discussion. These propensity scores are
then used to calculate the weights W
F
i
, W
AAI
i
, and W
FM
i
, respectively, as dened
in section 2.
15
The QTEs are estimated using these weights after applying some
trimming to eliminate observations with very large weights. The amount of trimming
is controlled by trim(), as explained below. This is the way the Firpo (2007)
estimator is implemented.
For the Abadie, Angrist, and Imbens (2002) and Fr olich and Melly (2008) estima-
tors, more comments are required. First, the Abadie, Angrist, and Imbens (2002)
estimator is only implemented with the positive weights W
AAI+
i
. Hence, when the
Abadie, Angrist, and Imbens (2002) estimator is activated via the aai option, rst
15. The weights W
KB
i
used in the Koenker and Bassett (1978) estimator are always equal to one.
434 Estimation of quantile treatment eects with Stata
the propensity score is estimated to calculate the weights W
AAI
i
, which are then
automatically projected via nonparametric regression to obtain W
AAI+
i
. This last
nonparametric regression to obtain the positive weights is controlled by the options
pkernel(), pbandwidth(), and plambda(), which are explained below. The letter
p in front of these options stresses that these are used to obtain the positive weights.
Finally, the Fr olich and Melly (2008) estimator is implemented in two ways. We can
either use the weights W
FM
i
after having trimmed very large weights, or alternatively,
we could also project these weights and then use W
FM+
i
to estimate the QTEs. If
one wants to pursue this second implementation, one has to activate the positive
option and specify pkernel() and pbandwidth() to control the projection of W
FM
i
to obtain the positive weights W
FM+
i
.
Estimation of the propensity score
linear selects the method used to estimate the instrument propensity score. If this
option is not activated, the local logit estimator is used. If linear is activated, the
local linear estimator is used.
mata opt selects the ocial optimizer introduced in Stata 10 to estimate the local logit,
Matas optimize(). The default is a simple GaussNewton algorithm written for
this purpose. This option is only relevant when the linear option has not been
selected.
kernel(kernel) species the kernel function used to estimate the propensity score. ker-
nel may be any of the following second-order kernels: epan2 (Epanechnikov ker-
nel function, the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8). By default, epan2, which species the Epanechnikov
kernel, is used.
16
16. Here are the formulas for these kernel functions for Epanechnikov of order 4 and 6, respectively:
K (z) =

15
8

35
8
z
2

3
4
`
1 z
2

1 (z < 1)
K (z) =

175
64

525
32
z
2
+
5775
320
z
4

3
4
`
1 z
2

1 (z < 1)
And here are the formulas for Gaussian of order 4, 6, and 8, respectively:
K (z) =
1
2
`
3 z
2

(z)
K (z) =
1
8
`
15 10z
2
+z
4

(z)
K (z) =
1
48
`
105 105z
2
+ 21z
4
z
6

(z)
M. Fr olich and B. Melly 435
bandwidth(#) sets the bandwidth h used to smooth over the continuous variables in the
estimation of the propensity score. The continuous regressors are rst orthogonalized
such that their covariance matrix is the identity matrix. The bandwidth must be
strictly positive. If the bandwidth h is missing, an innite bandwidth h = is
used. The default value is innity. If the bandwidth h is innity and the parameter
is one, a global model (linear or logit) is estimated without any local smoothing.
The cross-validation procedure implemented in locreg can be used to guide the
choice of the bandwidth. Because the optimal bandwidth converges at a faster rate
than the cross-validated bandwidth, the robustness of the results with respect to a
smaller bandwidth should be examined.
lambda(#) sets the used to smooth over the dummy and unordered discrete variables
in the estimation of the propensity score. It must be between 0 and 1. A value
of 0 implies that only observations within the cell dened by all discrete regressors
are used. The default is lambda(1), which corresponds to global smoothing. If
the bandwidth h is innity and = 1, a global model (linear or logit) is estimated
without any local smoothing. The cross-validation procedure implemented in locreg
can be used to guide the choice of lambda. Again the robustness of the results with
respect to a smaller bandwidth should be examined.
Estimation of the weights
trim(#) controls the amount of trimming. All observations with an estimated propen-
sity score less than trim() or greater than 1 trim() are trimmed and not used
further by the estimation procedure. This prevents giving very high weights to single
observations. The default is trim(0.001). This option is not useful for the Koenker
and Bassett (1978) estimator, where no propensity score is estimated.
positive is used only with the Fr olich and Melly (2008) estimator. If it is activated, the
positive weights W
FM+
i
dened in (5) are estimated by the projection of the weights
W
FM
on the dependent and the treatment variable. Weights W
FM+
are estimated
by nonparametric regression on Y , separately for the D = 1 and the D = 0 samples.
After the estimation, negative estimated weights in

W
FM+
i
are set to zero.
pbandwidth(#), plambda(#), and pkernel(kernel) are used to calculate the positive
weights. These options are useful only for the Abadie, Angrist, and Imbens (2002)
estimator, which can be activated via the aai option, and for the Fr olich and Melly
(2008) estimator, but only when the positive option has been activated to estimate
W
FM+
. pkernel() and pbandwidth() are used to calculate the positive weights if
the positive option has been selected. They are dened similarly to kernel(),
bandwidth(), and lambda(). When pkernel(), pbandwidth(), and plambda() are
not specied, the values given in kernel(), bandwidth(), and lambda() are taken
as default.
The positive weights are always estimated by local linear regression. After esti-
mation, negative estimated weights are set to zero. The smoothing parameters
436 Estimation of quantile treatment eects with Stata
pbandwidth() and plambda() are in principle as important as the other smoothing
parameters bandwidth() and lambda(), and it is worth inspecting the robustness
of the results with respect to these parameters. Cross-validation can also be used to
guide these choices.
Inference
variance activates the estimation of the variance. By default, no standard errors are
estimated because the estimation of the variance can be computationally demanding.
Except for the classical linear quantile regression estimator, it requires the estima-
tion of many nonparametric functions. This option should not be activated if you
bootstrap the results unless you bootstrap t-values to exploit possible asymptotic
renements.
vbandwidth(#), vlambda(#), and vkernel(kernel) are used to calculate the vari-
ance if the variance option has been selected. They are dened similarly to
bandwidth(), lambda(), and kernel(). They are used only to estimate the vari-
ance. A quick and dirty estimate of the variance can be obtained by setting
vbandwidth() to innity and vlambda() to 1, which is much faster than any other
choice. When vkernel(), vbandwidth(), or vlambda() is not specied, the values
given in kernel(), bandwidth(), and lambda() are taken as default.
level(#) species the condence level, as a percentage, for condence intervals. The
default is level(95) or as set by set level.
Saved propensity scores and weights
generate p(newvarname
_
, replace

) generates newvarname containing the esti-


mated propensity score. Remember that the propensity score is Pr(Z = 1|X) for the
Abadie, Angrist, and Imbens (2002) and Fr olich and Melly (2008) estimators and is
Pr(D = 1|X) for the Firpo (2007) estimator. This may be useful if one wants to
compare the results with and without the projection of the weights or to compare the
conditional and unconditional QTEs under endogeneity. One can rst estimate the
QTEs using one method and save the propensity score in the variable newvarname.
In the second step, one can use the already estimated propensity score as an input
in the phat() option. The replace option allows ivqte to overwrite an existing
variable or to create a new one where none exists.
generate w(newvarname
_
, replace

) generates newvarname containing the esti-


mated weights. This may be useful if you want to estimate several conditional
QTEs. The weights must be estimated only once and then be given as an input
in the what() option. The replace option allows ivqte to overwrite an existing
variable or to create a new one where none exists.
phat(varname) gives the name of an existing variable containing the estimated instru-
ment propensity score. The propensity score may have been estimated using ivqte
or with any other command such as a series estimator.
M. Fr olich and B. Melly 437
what(varname) gives the name of an existing variable containing the estimated weights.
The weights may have been estimated using ivqte or with any other command such
as a series estimator.
3.4 Saved results
ivqte saves the following in e():
Scalars
e(N) number of observations
e(bandwidth) bandwidth
e(lambda) lambda
e(pbandwidth) pbandwidth
e(plambda) plambda
e(vbandwidth) vbandwidth
e(vlambda) vlambda
e(pseudo r2) pseudo-R
2
of the quantile regression
e(compliers) proportion of compliers
e(trimmed) number of observations trimmed
Macros
e(command) ivqte
e(depvar) name of dependent variable
e(treatment) name of treatment variable
e(instrument) name of IV
e(continuous) name of continuous covariates
e(dummy) name of binary covariates
e(regressors) name of regressors (conditional QTEs)
e(unordered) name of unordered covariates
e(estimator) name of estimator
e(ps method) linear or logistic model
e(optimization) algorithm used
e(kernel) kernel function
e(pkernel) kernel function for positive weights
e(vkernel) kernel function for variance estimation
Matrices
e(b) row vector containing the QTEs
e(quantiles) row vector containing the quantiles at which the QTEs have been estimated
e(V) matrix containing the variances of the estimated QTEs in the diagonal
and 0 otherwise
Functions
e(sample) marks the estimation sample
3.5 Simple examples (without local smoothing)
Having given the syntax for ivqte in a previous subsection, we now illustrate how the
command can be used with some very simple examples. In particular, we defer the use
of smoothing parameters (h, ) to the next subsection to keep things simple here. This
means that all regressions will be estimated parametrically because the default values
are h = and = 1.
438 Estimation of quantile treatment eects with Stata
We use the distance to college dataset of card.dta.
17
The aim is to estimate
the eect of having a college degree (college) on log wages (lwage), controlling for
parental education, experience, race, and region. A potential instrument is living near a
four-year college (nearc4). The control variables are experience, exper (continuous vari-
able); mothers education, motheduc (ordered discrete); region (unordered discrete);
and black (dummy).
We rst consider the quantile regression estimator with exogenous regressors for the
rst decile. As mentioned, this estimator is already implemented in Stata with the qreg
command:
. use card
. qreg lwage college exper black motheduc reg662 reg663 reg664 reg665 reg666
> reg667 reg668 reg669, quantile(0.1)
(output omitted )
The syntax of the ivqte command is similar with the exception that the treatment
variable has to be included in parentheses after all other regressors:
18
. ivqte lwage exper black motheduc reg662 reg663 reg664 reg665 reg666 reg667
> reg668 reg669 (college), quantiles(0.1) variance
(output omitted )
The point estimates are exactly identical because ivqte calls qreg, but the standard
errors dier. We recommend using the standard errors of ivqte because they are robust
against heteroskedasticity and other forms of dependence between the residuals and the
regressors.
We may be concerned that having a college degree might be endogenous and consider
using the proximity of a four-year college as an instrument. The proximity of a
four-year college is a binary variable, taking the value 1 if a college was close by. If
we are interested in the conditional QTE, we can apply the estimator suggested by
Abadie, Angrist, and Imbens (2002), as follows:
. ivqte lwage (college=nearc4), quantiles(0.1) variance dummy(black)
> continuous(exper motheduc) unordered(region) aai
(output omitted )
There are three dierences compared with the previous syntax: First, the instrument
has to be given in parentheses after the treatment variable and the equal sign, that is,
(college=nearc4). Second, the control variables X are to be given in the corresponding
optionsdummy(), continuous(), and unordered()because they are used not only
to dene the conditional QTE but also in the nonparametric estimation of the weights.
Third, the aai option must be activated. region enters here as a single unordered
17. This dataset is available for download from Stata together with the programs described in the
present article. The description of the variables can be found in Card (1995).
18. For this case of exogenous conditional QTEs, it is in principle arbitrary which variable is dened
as the treatment variable because the coecients are estimated for all regressors. In addition,
nonbinary treatments are permitted here.
M. Fr olich and B. Melly 439
discrete variable, which is expanded by ivqte to eight regional dummy variables in the
parametric model.
The two examples discussed so far refer to the conditional treatment eect of college
degree. We might be more interested in the unconditional QTE, which we examine in the
following example. Consider rst the case where college degree is exogenous conditional
on X. The weighting estimator of Firpo (2007) is implemented by ivqte. We are
interested in the nine decile treatment eects with this estimator:
. ivqte lwage (college), variance dummy(black) continuous(exper motheduc)
> unordered(region)
(output omitted )
Only the treatment is given in parentheses, and the aai option is no longer activated.
Finally, to estimate unconditional QTEs with an endogenous treatment, the estimator
of Fr olich and Melly (2008) is implemented in ivqte. The only dierence from the
previous syntax is that the instrument (nearc4) now has to be given after the treatment
variable:
. ivqte lwage (college = nearc4), variance dummy(black)
> continuous(exper motheduc) unordered(region)
(output omitted )
By default, the weights dened in (4) are used. If the positive option is activated,
the positive weights (5) are estimated and used:
. ivqte lwage (college = nearc4), variance dummy(black)
> continuous(exper motheduc) unordered(region) positive
(output omitted )
If no control variables are included, then ivqte lwage (college = nearc4), aai
and ivqte lwage (college = nearc4), positive produce the same results.
3.6 Advanced examples (with local smoothing)
In the examples given above, we have not used the smoothing options. Therefore,
by default, parametric regressions have been used to estimate all functions. In an
application, we should use smoothing parameters converging to 0, unless we have strong
reasons to believe that we do know the true functional forms. Appendix B contains many
details about the nonparametric estimation of functions. We illustrate here the use of
these techniques for ivqte.
(Continued on next page)
440 Estimation of quantile treatment eects with Stata
We use card.dta and keep only 500 randomly sampled observations from card.dta
to reduce computation time. Because of missing values on covariates, eventually only
394 observations are retained in the estimation. The aim is to estimate the eect
of having a college degree (college) on log wages (lwage). A potential instrument is
living near a four-year college (nearc4). For ease of presentation, we use only experience
(exper) as a continuous control variable here. The other control variables are region
(unordered()) and black (dummy()).
Depending on the estimator, up to three functions have to be estimated nonpara-
metrically. Three sets of options correspond to these three functions. The options
kernel(), bandwidth(), and lambda() determine the kernel and the parameters h and
used for the estimation of the propensity score. It corresponds to Pr(Z = 1|X) for
the Abadie, Angrist, and Imbens (2002) and Fr olich and Melly (2008) estimators and
to Pr(D = 1|X) for the Firpo (2007) estimator.
The options pkernel(), pbandwidth(), and plambda() determine the kernel and
smoothing parameters used for the estimation of the positive weights dened in (3) for
the estimator of Abadie, Angrist, and Imbens (2002). If the Fr olich and Melly (2008)
estimator is to be used and the positive option has been activated, pkernel() and
pbandwidth() are used to estimate the positive weights (5).
Finally, the options vkernel(), vbandwidth(), and vlambda() are used for the
estimation of the variances of the estimators of Abadie, Angrist, and Imbens (2002),
Firpo (2007), and Fr olich and Melly (2008).
A general nding in the literature is that the choice of the kernel functions
kernel(), pkernel(), and vkernel()is rarely crucial. The options vbandwidth()
and vlambda() are used only for the estimation of the variance. Therefore, during the
exploratory analysis, it may make sense to reduce the computational time by setting
vbandwidth() to innity and vlambda() to one, that is, a parametric model. For the
nal set of estimates, it often makes sense to set vbandwidth() equal to bandwidth()
and vlambda() equal to lambda(). This is done by default. In the following illustra-
tion, we show how the cross-validation procedure implemented in locreg can be used
to guide the choice of the important smoothing parameters.
We start with the estimator proposed by Firpo (2007). We do not need to use
the options pkernel(), pbandwidth(), and plambda() because the weights are always
positive by denition. We choose h = 2 and = 0.8 for the estimation of the propensity
score. In addition, we use h = and = 1 for the estimation of the variance. We use
the default Epanechnikov kernel in all cases.
. use card, clear
. set seed 123
. sample 500, count
(2510 observations deleted)
. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)
> unordered(region) bandwidth(2) lambda(0.8) variance vbandwidth(.) vlambda(1)
(output omitted )
M. Fr olich and B. Melly 441
Of course, these choices of the smoothing parameters are arbitrary. One can use
the cross-validation option of locreg to choose the smoothing parameters. When we
use the Firpo (2007) estimator, we know that the options bandwidth() and lambda()
are used to estimate Pr (D = 1 |X). Therefore, we can select the smoothing parameters
from a grid of, say, four possible values, as follows. We use the logit option because
D is a binary variable.
. locreg college, dummy(black) continuous(exper) unordered(region)
> bandwidth(1 2) lambda(0.8 1) logit
(output omitted )
The scalars r(optb) and r(optl) indicate that the choices of h = 1 and = 1
minimize the cross-validation criterion. We use the 2 2 search grid only for ease of
exposition. Usually, one would search within a much larger grid. We can now obtain
the point estimate using this choice of the smoothing parameters:
. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)
> unordered(region) bandwidth(1) lambda(1)
(output omitted )
In addition to the values suggested by cross-validation, the user is encouraged to
also try otherespecially smallersmoothing parameters and examine the robustness
of the nal results. For instance, we examine the results with h = 0.5 and = 0.5.
. ivqte lwage (college), quantiles(0.5) dummy(black) continuous(exper)
> unordered(region) bandwidth(0.5) lambda(0.5)
(output omitted )
In this case, the results are relatively stable.
When we use the estimator of Abadie, Angrist, and Imbens (2002), we have to addi-
tionally specify pbandwidth() and plambda() for estimating the positive weights. We
proceed by rst choosing values for bandwidth() and lambda() and thereafter choosing
values for pbandwidth() and plambda(). We know that the options bandwidth() and
lambda() are used to estimate Pr (Z = 1 |X). Therefore, we can select the smoothing
parameters from a grid of four possible values, as follows. Again we use the logit
option because Z is a binary variable.
. locreg nearc4, dummy(black) continuous(exper) unordered(region)
> bandwidth(0.5 0.8) lambda(0.8 1) generate(ps) logit
(output omitted )
The optimal smoothing parameters are h = 0.8 and = 0.8. The generate(ps)
option implies that the tted values E (Z = 1 |X) are saved in the variable ps. These
tted values are generated using the optimal bandwidth. That is, they are generated
after the cross-validation has selected the optimal bandwidth.
442 Estimation of quantile treatment eects with Stata
In the next step, we need to select bandwidths for pbandwidth() and plambda().
We know by equations (2) and (3) that pbandwidth() and plambda() are used to
estimate
E
_
W
AAI
i
|Y
i
, D
i
, X
i
_
= E
_
1
D
i
(1 Z
i
)
1 Pr (Z = 1 |X
i
)

(1 D
i
) Z
i
Pr (Z = 1 |X
i
)
|Y
i
, D
i
, X
i
_
We rst generate W
AAI
i
:
. generate waai=1-college*(1-nearc4)/(1-ps)-(1-college)*nearc4/ps
Then we can use locreg to nd the optimal parameters. The positive weights
are obtained by a nonparametric regression of W
AAI
on X and Y and D. This is
implemented in ivqte via two separate regressions: one nonparametric regression of
W
AAI
on X and Y for the D = 1 subsample and one separate nonparametric regression
of W
AAI
on X and Y for the D = 0 subsample. We proceed in the same way here by
adding Y , which in our example above is lwage, as a continuous regressor and running
separate regressions for the college==1 and the college==0 subsamples:
. locreg waai if college==1, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )
. locreg waai if college==0, dummy(black) continuous(exper lwage)
> unordered(region) bandwidth(0.5 0.8) lambda(0.8 1)
(output omitted )
In the rst case (that is, for the college==1 subsample), the optimal smoothing pa-
rameters are h = 0.8 and = 1. For the college==0 subsample, the optimal smoothing
parameters are h = 0.8 and = 0.8. The current implementation of ivqte permits only
one value of h and in the options pbandwidth() and plambda() to not overburden
the user with choosing nuisance parameters. If the suggested values for h and are
dierent for the college==1 and the college==0 subsamples, we recommend choos-
ing the smaller of these values but also examining the robustness of the results to the
other values. We suggest using the smaller bandwidth because the inference provided
by ivqte is based on the asymptotic formula given in (7) (see Appendix A.2), which
only contains variance but no bias term. To increase the accurateness of the inference
based on (7), one would prefer bandwidth choices that lead to smaller biases.
In our example, we choose pbandwidth(0.8), which was suggested by cross-valida-
tion in the college==1 and the college==0 subsamples, and plambda(0.8), which is
the smaller value of . With these bandwidth choices, we obtain the nal estimates:
. ivqte lwage (college=nearc4), aai quantiles(0.5) dummy(black)
> continuous(age fatheduc motheduc) unordered(region) bandwidth(0.8) lambda(0.8)
> pbandwidth(0.8) plambda(0.8) variance
(output omitted )
By using the variance option without specifying vbandwidth() nor vlambda(), the
values given for bandwidth() and lambda() are used as defaults for vbandwidth() and
vlambda(). Alternatively, we could have specied dierent values for vbandwidth() and
M. Fr olich and B. Melly 443
vlambda(). In an exploratory analysis, one could use, for example, h = and = 1,
which are certainly nonoptimal choices but reduce computation time considerably.
. ivqte lwage (college=nearc4), aai quantiles(0.5) dummy(black)
> continuous(age fatheduc motheduc) unordered(region) bandwidth(0.8) lambda(0.8)
> pbandwidth(0.8) plambda(0.8) variance vbandwidth(.) vlambda(1)
(output omitted )
3.7 Replication of results of AAI
In the last illustration of the ivqte command, we replicate tables II-A and III-A of
Abadie, Angrist, and Imbens (2002). jtpa.dta contains their dataset for males. We
can replicate the point estimates of table II A with the ocial Stata qreg command:
. use jtpa, clear
. global reg "highschool black hispanic married part time classroom
> OJT JSA age5 age4 age3 age2 age1 second follow"
. qreg earnings treatment $reg, quantile(0.5)
(output omitted )
The same point estimates can also be obtained using ivqte:
. ivqte earnings $reg (treatment), quantiles(0.5) variance
(output omitted )
The standard errors are still dierent from the standard errors in the published arti-
cle because Abadie, Angrist, and Imbens (2002) have used a somewhat unconventional
bandwidth. We can replicate their standard errors of table II-A by activating the aai
option.
19
The following command calculates the results for the median.
. ivqte earnings (treatment=treatment), quantiles(0.5) dummy($reg) variance
> aai
(output omitted )
Now we attempt to replicate the results of table III-A. Using a bandwidth of 2,
. ivqte earnings (treatment=assignment), quantiles(0.5) dummy($reg) variance
> aai pbandwidth(2)
(output omitted )
19. In this command, we use the fact that the estimator of Abadie, Angrist, and Imbens (2002) simpli-
es to the standard quantile regression estimator when the treatment is used as its own instrument.
Similar relationships exist for the estimator proposed by Frolich and Melly (2008) and are discussed
in their article.
444 Estimation of quantile treatment eects with Stata
gives results that are slightly dierent from their table III-A. We cannot exactly repli-
cate their results because they have used series estimators to estimate the nonparametric
components of their estimator and because they have exploited the fact that the instru-
ment was completely randomized.
20
In the following commands, we show how ivqte
with the options phat(varname) and what(varname) can be used to replicate their
results. The parameters and standard errors are then almost identical to the original
results.
21
Abadie, Angrist, and Imbens (2002) rst note that the instrument assignment has
been fully randomized. Therefore, they estimate Pr (Z = 1 |X) by the sample mean of
Z. Then the negative and positive weights W
AAI
i
can be generated:
. summarize assignment
(output omitted )
. generate pi=r(mean)
. generate kappa=1-treatment*(1-assignment)/(1-pi)-(1-treatment)*assignment/pi
In a second step, the positive weights W
AAI+
i
are estimated by a linear regression of
W
AAI
i
on a polynomial of order 5 in Y and D:
. forvalues i=1/5 {
2. generate e`i=earnings^`i
3. generate de`i=e`i*treatment
4. }
. regress kappa earnings treatment e2 e3 e4 e5 de1 de2 de3 de4 de5
(output omitted )
. predict kappa pos
(option xb assumed; fitted values)
. ivqte earnings (treatment=assignment) if kappa pos>0, dummy($reg)
> quantiles(0.5) variance aai what(kappa pos) phat(pi)
(output omitted )
which gives almost the same estimates and standard errors as their table III-A.
4 Acknowledgments
We would like to thank Ben Jann, Andreas Landmann, Robert Poppe, and an anony-
mous referee for helpful comments.
20. There are no strong reasons to prefer series estimators to local nonparametric estimators. We
will use a series estimator here only to show that we can replicate their results. Actually,
Frolich and Melly (2008) require, in some sense, weaker regularity assumptions for the local es-
timator than what was required for the existing series estimators.
21. A small dierence still remains, which is due to dierences in the implementation of the estimator
of H(X). With slight adaptations, which would restrict the generality of the estimator, we can
replicate their results exactly.
M. Fr olich and B. Melly 445
5 References
Abadie, A. 2003. Semiparametric instrumental variable estimation of treatment response
models. Journal of Econometrics 113: 231263.
Abadie, A., J. Angrist, and G. Imbens. 2002. Instrumental variables estimates of the
eect of subsidized training on the quantiles of trainee earnings. Econometrica 70:
91117.
Card, D. E. 1995. Using geographic variation in college proximity to estimate the return
to schooling. In Aspects of Labour Economics: Essays in Honour of John Vanderkamp,
ed. L. Christodes, E. K. Grant, and R. Swindinsky. Toronto, Canada: University of
Toronto Press.
Chernozhukov, V., I. Fern andez-Val, and A. Galichon. 2010. Quantile and probability
curves without crossing. Econometrica 78: 10931125.
Chernozhukov, V., and C. Hansen. 2005. An IV model of quantile treatment eects.
Econometrica 73: 245261.
Firpo, S. 2007. Ecient semiparametric estimation of quantile treatment eects. Econo-
metrica 75: 259276.
Fr olich, M. 2007a. Propensity score matching without conditional independence
assumptionwith an application to the gender wage gap in the United Kingdom.
Econometrics Journal 10: 359407.
. 2007b. Nonparametric IV estimation of local average treatment eects with
covariates. Journal of Econometrics 139: 3575.
Fr olich, M., and B. Melly. 2008. Unconditional quantile treatment eects under en-
dogeneity. Discussion Paper No. 3288, Institute for the Study of Labor (IZA).
http://ideas.repec.org/p/iza/izadps/dp3288.html.
Hall, P., and S. J. Sheather. 1988. On the distribution of a studentized quantile. Journal
of the Royal Statistical Society, Series B 50: 381391.
Jann, B. 2005a. kdens: Stata module for univariate kernel density estimation. Sta-
tistical Software Components S456410, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456410.html.
. 2005b. moremata: Stata module (Mata) to provide various Mata functions.
Statistical Software Components S455001, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s455001.html.
Koenker, R. 2005. Quantile Regression. New York: Cambridge University Press.
Koenker, R., and G. Bassett Jr. 1978. Regression quantiles. Econometrica 46: 3350.
446 Estimation of quantile treatment eects with Stata
Melly, B. 2006. Estimation of counterfactual distributions using quantile regression.
Discussion paper, Universitat St. Gallen.
http://www.alexandria.unisg.ch/Publikationen/22644.
Powell, J. L. 1986. Censored regression quantiles. Journal of Econometrics 32: 143155.
Racine, J., and Q. Li. 2004. Nonparametric estimation of regression functions with both
categorical and continuous data. Journal of Econometrics 119: 99130.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Boca
Raton, FL: Chapman & Hall/CRC.
About the authors
Markus Frolich is a full professor of econometrics at the University of Mannheim and is Pro-
gram Director for Employment and Development at the Institute for the Study of Labor. His
research interests include policy evaluation, microeconometrics, labor economics, and develop-
ment economics.
Blaise Melly is an assistant professor of economics at Brown University. He specializes in
microeconometrics and applied labor economics and has special interests in the eects of policies
on the distribution of outcome variables.
A Variance estimation
In this section, we describe the analytical variance estimators implemented in ivqte.
The bootstrap represents an alternative and can be implemented in Stata using the
bootstrap prex. The validity of the bootstrap has been proven for standard quantile
regression but not for the other estimators so far, but it seems likely that it is valid.
A.1 Conditional exogenous QTEs
Let X = (D, X

and

= (

. The asymptotic distribution of the quantile


regression estimator dened in (1) is given by
22

n(

) N
_
0, J
1

J
1

_
where J

= E
_
f
Y |X
(X

) XX

_
and

= (1 ) E
_
XX

_
. The term

is
straightforward to estimate by (1 ) n
1

X
i
X

i
. We estimate J

by the kernel
method of Powell (1986),

=
1
nh
n

k
_
Y
i
X

h
n
_
X
i
X

i
22. See, for example, Koenker (2005).
M. Fr olich and B. Melly 447
where k is a univariate kernel function and h
n
is a bandwidth sequence. In the actual
implementation, we use a normal kernel and the bandwidth suggested by Hall and
Sheather (1988),
h
n
= n
1/3

1
(1 level/2)
2/3
_
1.5
_

1
()
_
2
2{
1
()}
2
+ 1
_
1/3
where level is the level for the intended condence interval, and and are the normal
density and distribution functions, respectively. This estimator of the asymptotic vari-
ance is consistent under heteroskedasticity, which is in contrast to the ocial Stata com-
mand for quantile regression, qreg. This is important because quantile regression only
becomes interesting when the errors are not independent and identically distributed.
A.2 Conditional endogenous QTEs
The asymptotic distribution of the IV quantile regression estimator dened in (2) is
given by

n(

IV

) N
_
0, I
1

I
1

_
(7)
where I

= E
_
f
Y |X,D1>D0
(X

) XX

|D
1
> D
0
_
Pr (D
1
> D
0
) and

= E (

)
with = W
AAI
m

(X, Y ) +H (X) {Z Pr (Z = 1 |X)} and


m

(X, Y ) = { 1 (Y X

< 0)} X and


H (X) = E
_
m

(X, Y )
_

_
D(1 Z) / {1 Pr (Z = 1 |X)}
2
_
+
(1 D) Z/Pr (Z = 1 |X)
2
_
|X
_
.
We estimate these elements as

=
1
nh
n

W
AAI+
i
k
_
Y
i
X

IV
h
n
_
X
i
X

i
where

W
AAI+
i
are estimates of the projected weights. For the kernel function in the
previous regression, we use an Epanechnikov kernel and h
n
= n
0.2
_
Var (Y
i
X

IV
)
as proposed by Abadie, Angrist, and Imbens (2002).
23
Furthermore,

H (X
i
) is estimated by the local linear regression of
{ 1 (Y
i
X

IV
< 0)} X
i

D
i
(1 Z
i
)
_
1

Pr (Z = 1 |X
i
)
_
2
+
(1 D
i
) Z
i

Pr (Z = 1 |X)
2

on X
i
This nonparametric regression is controlled by the options vkernel(), vbandwidth(),
and vlambda() in ivqte. With these ingredients, we calculate
23. In principle, the same kernel and bandwidth as those for quantile regression can be used. These
choices were made to replicate the results of Abadie, Angrist, and Imbens (2002).
448 Estimation of quantile treatment eects with Stata

i
=

W
AAI
i
{ 1 (Y
i
X

IV
< 0)} X
i
+

H (X
i
)
_
Z
i

Pr (Z = 1 |X
i
)
_

=
1
n

i
where

W
AAI
i
are estimates of the weights.
A.3 Unconditional exogenous QTEs
The asymptotic distribution of the estimator dened in (6) is given by

n
_

_
N (0, V)
with
V =
1
f
2
Y
1
_
Q

Y
1
_E
_
F
Y |D=1,X
_
Q

Y
1
_ _
1 F
Y |D=1,X
_
Q

Y
1
__
Pr(D = 1|X)
_
+
1
f
2
Y
0
_
Q

Y
0
_E
_
F
Y |D=0,X
_
Q

Y
0
_ _
1 F
Y |D=0,X
_
Q

Y
0
__
1 Pr(D = 1|X)
_
+E
_
{
1
(X)
0
(X)}
2
_
where
d
(x) = F
Y |D=d,X
(Q

Y
d
)/f
Y
d(Q

Y
d
). Q

Y
0
and Q

Y
1
have already been esti-
mated by and +

, respectively. The densities f


Y
d(Q

Y
d
) are estimated by weighted
kernel estimators

f
Y
d
_

Y
d
_
=
1
nh
n

Di=d

W
F
i
k
_
Y
i

Y
d
h
n
_
with Epanechnikov kernel function and Silverman (1986) bandwidth choice, and where
F
Y |D=d,X
(Q

Y
d
) is estimated by the local logit estimator described in appendix B.
A.4 Unconditional endogenous QTEs
Finally, the asymptotic variance of the estimator dened in (4) is the most tedious and
is given by
M. Fr olich and B. Melly 449
V =
1
P
2
c
f
2
Y
1
|c

Y
1
|c
E

(X, 1)
p(X)
F
Y |D=1,Z=1,X
`
Q

Y
1
|c

1 F
Y |D=1,Z=1,X
`
Q

Y
1
|c

+
1
P
2
c
f
2
Y
1
|c

Y
1
|c
E

(X, 0)
1 p(x)
F
Y |D=1,Z=0,X
`
Q

Y
1
|c

1 F
Y |D=1,Z=0,X
`
Q

Y
1
|c

+
1
P
2
c
f
2
Y
0
|c

Y
0
|c
E

1 (X, 1)
p(X)
F
Y |D=0,Z=1,X
`
Q

Y
0
|c

1 F
Y |D=0,Z=1,X
`
Q

Y
0
|c

+
1
P
2
c
f
2
Y
0
|c

Y
0
|c
E

1 (X, 0)
1 p(X)
F
Y |D=0,Z=0,X
`
Q

Y
0
|c

1 F
Y |D=0,Z=0,X
`
Q

Y
0
|c

+E

(X, 1)
2
11
(X) +{1 (X, 1)}
2
01
(X)
p(X)
+
(X, 0)
2
10
(X) +{1 (X, 0)}
2
00
(X)
1 p(X)

p(X) {1 p(X)}

(X, 1)11(X) +{1 (X, 1)} 01(X)


p(X)
+
(X, 0)10(X) +{1 (X, 0)} 00(X)
1 p(X)

2
!
where
dz
(x) = F
Y |D=d,Z=z,X
(Q

Y
d
|c
)/P
c
f
Y
d
|c
(Q

Y
d
|c
); p(x) = Pr(Z = 1|X = x);
(x, z) = Pr(D = 1|X = x, Z = z); and P
c
is the fraction of compliers. Q

Y
0
|c
and
Q

Y
1
|c
have already been estimated by
IV
and
IV
+

IV
, respectively. The terms
F
Y |D=d,Z=z,X
(Q

Y
d
), p(X), and (x, z) are estimated by the local logit estimator de-
scribed in appendix B. Finally, P
c
is estimated by

(X
i
, 1) (X
i
, 0).
(Continued on next page)
450 Estimation of quantile treatment eects with Stata
To estimate the densities f
Y
d
|c
(Q

Y
d
|c
), we note that
24
f
Y
d
|c
(Q

Y
d
|c
) = lim
h
1
h
1
_
0
k
_
Q

Y
d
|c
Q

Y
d
|c
h
_
d
where k is the Epanechnikov kernel function with Silverman (1986) bandwidth. We
therefore estimate f
Y
d
|c
as

f
Y
d
|c
(

Y
d
|c
) =
1
h
1
_
0
k
_

Q

Y
d
|c

Y
d
|c
h
_
d
where we replace the integral by a sum of n uniformly spaced values of between 0
and 1.
B Nonparametric regression with mixed data
B.1 Local parametric regression
A key ingredient for the previously introduced estimators (except for the exogenous con-
ditional quantile regression estimator) is the nonparametric estimation of some weights.
Local linear and local logit estimators have been implemented for this purpose. This
is fully automated in the ivqte command. Nevertheless, some understanding of the
nonparametric estimators facilitates the use of the ivqte command.
In many instances, we need to estimate conditional expected values like
E (Y |X = X
i
). We use a local parametric approach throughout; that is, we estimate a
locally weighted version of the parametric model. A complication is that many econo-
metric applications contain continuous as well as discrete regressors X. Both types of
regressors need to be accommodated in the local parametric model and in the kernel
function dening the local neighborhood. The traditional nonparametric approach con-
sists of estimating the model within each of the cells dened by the discrete regressors
24. To see this, note that
1
h
1
Z
0
k
0
@
Q

Y
d
|c
Q

Y
d
|c
h
1
A
d =

k (u) f
Y
d
|c
(uh +Q

Y
d
|c
) du
where we used a change in variables uh = Q

Y
d
|c
Q

Y
d
|c
, which implies that = F
Y
d
|c
(uh +
Q

Y
d
|c
) and d = f
Y
d
|c
(uh + Q

Y
d
|c
) hdu. By the mean value theorem, f
Y
d
|c
(uh + Q

Y
d
|c
) =
f
Y
d
|c
(Q

Y
d
|c
) +uh f

Y
d
|c
(Q), where Q is on the line between Q

Y
d
|c
and uh +Q

Y
d
|c
. Hence,
= f
Y
d
|c
(Q

Y
d
|c
)

k (u) du +O(h) = f
Y
d
|c
(Q

Y
d
|c
) +O(h)
M. Fr olich and B. Melly 451
and of smoothing only with respect to the continuous covariates. When the number
of cells in a dataset is large, each cell may not have enough observations to nonpara-
metrically estimate the relationship among the remaining continuous variables. For
this reason, many applied researchers have treated discrete variables in a parametric
way. We follow an intermediate way and use the hybrid product kernel developed by
Racine and Li (2004). This estimator covers all cases from the fully parametric model
up to the traditional nonparametric estimator.
Overall, we can distinguish four dierent types of regressors: continuous (for exam-
ple, age), ordered discrete (for example, family size), unordered discrete (for example,
regions), and binary variables (for example, gender). We will treat ordered discrete and
continuous variables in the same way and will refer to them as continuous variables in
the following discussion.
25
The unordered discrete and the binary variables are handled dierently in the kernel
function and in the local parametric model. The binary variables enter into both as
single regressors. The unordered discrete variables, however, enter as a single regressor
in the kernel function and as a vector of dummy variables in the local model. Consider,
for example, a variable called region that takes four dierent values: north, south, east,
and west. This variable enters as a single variable in the kernel function but is included
in the local model in the form of three dummies: south, east, and west.
The kernel function is dened in the following paragraph. Suppose that the variables
in X are arranged such that the rst q
1
regressors are continuous (including the ordered
discrete variables) and the remaining Q q
1
regressors are discrete without natural
ordering (including binary variables). The kernel weights K(X
i
x) are computed as
K
h,
(X
i
x) =
q1

q=1

_
X
q,i
x
q
h
_

Q

q=q1+1

1(Xq,i=xq)
where X
q,i
and x
q
denote the qth element of X
i
and x, respectively; 1() is the indicator
function; is a symmetric univariate weighting function; and h and are positive
bandwidth parameters with 0 1. This kernel function measures the distance
between X
i
and x through two components: The rst term is the standard product
kernel for continuous regressors with h dening the size of the local neighborhood. The
second term measures the mismatch between the unordered discrete (including binary)
regressors. denes the penalty for the unordered discrete regressors. For example,
the multiplicative weight contribution of the Qth regressor is 1 if the Qth element of
X
i
and x are identical, and it is if they are dierent. If h = and = 1, then
the nonparametric estimator corresponds to the global parametric estimator and no
interaction term between the covariates is allowed. On the other hand, if is zero
and h is small, then smoothing proceeds only within each of the cells dened by the
25. Racine and Li (2004) suggest using a geometrically declining kernel function for the ordered discrete
regressors. There are no reasons, however, against using quadratically declining kernel weights. In
other words, we can use the same (for example, Epanechnikov) kernel for ordered discrete as for
continuous regressors. We therefore treat ordered discrete regressors in the same way as continuous
regressors in the following discussion.
452 Estimation of quantile treatment eects with Stata
discrete regressors and only observations with similar continuous covariates will be used.
Finally, if and h are in the intermediate range, observations with similar discrete and
continuous covariates will be weighted more but further observations will also be used.
Principally, instead of using only two bandwidth values h, for all regressors, a dif-
ferent bandwidth could be employed for each regressor, but doing so would substantially
increase the computational burden for bandwidth selection. This approach might lead
to additional noise due to estimating these bandwidth parameters. Therefore, we prefer
to use only two smoothing parameters. ivqte automatically orthogonalizes the data
matrix of all continuous regressors to create an identity covariance matrix. This greatly
diminishes the appeal of having multiple bandwidths.
This kernel function, combined with a local model, is used to estimate E (Y |X).
If Y is a continuous variable, then ivqte uses by default a local linear estimator to
estimate E (Y |X = x) as in
( ,

) = arg min
a,b
n

j=1
{Y
j
a b (X
j
x)}
2
K
h,
(X
j
x)
If Y is bound from above and below, a local logistic model is usually preferred. We
suppose in the following discussion that Y is bound within [0, 1].
26
This includes the
special case where Y is binary. The local logit estimator guarantees that the tted
values are always between 0 and 1. The local logit estimator can be used by selecting
the logit option. In this case, E (Y |X = x) is estimated by ( ), where
( ,

) = arg min
a,b
n

j=1
(Y
j
ln {a +b (X
j
x)}
+(1 Y
j
) ln [1 {a +b (X
j
x)}]) K
h,
(X
j
x)
and (x) = 1/1 +e
x
.
As mentioned before, each of the unordered discrete variables enters in the form of
a dummy variable for each of its support points except for an arbitrary base category;
for example, if the region variable takes four dierent values, then three dummies are
included.
The ivqte command requires that the values of the smoothing parameters h and
are supplied by the user. Before estimating local linear or local logit with these
smoothing parameters, ivqte (as well as locreg) rst attempts to estimate the global
model (that is, with h = and = 1). If estimation fails due to collinearity or perfect
prediction, the regressors which caused these problems are eliminated.
27
Thereafter, the
model is estimated locally with the user-supplied smoothing parameters. If estimation
26. If the lower and upper bounds of Y are dierent from 0 and 1, Y should be rescaled to the interval
[0, 1].
27. This is done using rmcollright, where ivqte rst searches for collinearity among the continuous
regressors and thereafter among all other regressors.
M. Fr olich and B. Melly 453
fails locally because of collinearity or perfect prediction, the bandwidths are increased
locally. This is repeated until convergence is achieved.
The locreg command also contains a leave-one-out cross-validation procedure to
choose the smoothing parameters.
28
The user provides a grid of values for h and , and
the cross-validation criterion is computed for all possible combinations of these values.
The values of the cross-validation criterion are returned in r(cross valid) and the
combination that minimizes this criterion is chosen. If only one value is given for h and
, no grid search is performed.
B.2 The locreg command
Because the codes implementing the nonparametric regressions are likely to be of inde-
pendent interest in other contexts, we oer a separate command for the local parametric
regressions. This locreg command implements local linear and local logit regression
and chooses the smoothing parameters by leave-one-out cross-validation. The formal
syntax of locreg is as follows:
locreg depvar
_
if
_
in
_
weight
_
, generate(newvarname
_
, replace

)
continuous(varlist) dummy(varlist) unordered(varlist) kernel(kernel)
bandwidth(#
_
#
_
# . . .

) lambda(#
_
#
_
# . . .

) logit mata opt
sample(varname
_
, replace

aweights and pweights are allowed. See the [U] 11.1.6 weight for more information
on weights.
B.3 Description
locreg computes the nonparametric estimation of the mean of depvar conditionally on
the regressors given in continuous(), dummy(), and unordered(). A mixed kernel is
used to smooth over the continuous and discrete regressors. The tted values are saved
in the variable newvarname. If a list of values is given in bandwidth() or lambda(), the
smoothing parameters h and are estimated via leave-one-out cross-validation. The
values of h and minimizing the cross-validation criterion are selected. These values are
then used to predict depvar, and the tted values are saved in the variable newvarname.
locreg can be used in three dierent ways. First, if only one value is given in
bandwidth() and one in lambda(), locreg estimates the nonparametric regression
using these values and saves the tted values in generate(newvarname). Alternatively,
28. The cross-validated parameters are optimal to estimate the weights but are not optimal to estimate
the unconditional QTE. In the absence of a better method, we oer cross-validation, but the user
should keep in mind that the optimal bandwidths for the unconditional QTE converge to zero at a
faster rate than the bandwidths delivered by cross-validation. The user is therefore encouraged to
also examine the estimated QTE when using some undersmoothing relative to the cross-validation
bandwidths.
454 Estimation of quantile treatment eects with Stata
locreg can also be used to estimate the smoothing parameters via leave-one-out cross-
validation. If we do not specify the generate() option but supply a list of values in
the bandwidth() or lambda() option only the cross-validation is performed. Finally, if
several values are specied in bandwidth() or lambda() when the generate() option is
also specied, locreg estimates the optimal smoothing parameters via cross-validation.
Thereafter, it estimates the conditional means with these smoothing parameters and
returns the tted values in the variable generate(newvarname).
For the nonparametric regression, locreg oers two local models: linear and logistic.
The logistic model is usually preferred if depvar is bound within [0, 1]. This includes
the case where depvar is binary but also incorporates cases where depvar is nonbinary
but bound from above and below. If the lower and upper bounds of depvar are dierent
from 0 and 1, the variable depvar should be rescaled to the interval [0, 1] before using
this command. If depvar is not bound from above and below, the linear model should
be used.
29
B.4 Options
generate(newvarname
_
, replace

) species the name of the variable that will con-


tain the tted values. If this option is not used, only the leave-one-out cross-
validation estimation of the smoothing parameters h and will be performed. The
replace option allows locreg to overwrite an existing variable or to create a new
one where none exists.
continuous(varlist), dummy(varlist), and unordered(varlist) specify the names of the
covariates depending on their type. Ordered discrete variables should be treated as
continuous.
kernel(kernel) species the kernel function. kernel may be epan2 (Epanechnikov ker-
nel function; the default), biweight (biweight kernel function), triweight (tri-
weight kernel function), cosine (cosine trace), gaussian (Gaussian kernel func-
tion), parzen (Parzen kernel function), rectangle (rectangle kernel function), or
triangle (triangle kernel function). In addition to these second-order kernels, there
are also several higher-order kernels: epanechnikov o4 (Epanechnikov order 4),
epanechnikov o6 (order 6), gaussian o4 (Gaussian order 4), gaussian o6 (order
6), gaussian o8 (order 8).
30
bandwidth(#
_
#
_
# . . .

) is used to smooth over the continuous variables. The de-
fault is h = . The continuous regressors are rst orthogonalized such that their
covariance matrix is the identity matrix. The bandwidth must be strictly positive.
If the bandwidth h is missing, an innite bandwidth h = is used. The default
value is innity.
29. In the current implementation, there is not yet a local model specically designed for depvar that
is bound only from above or only from below. A local tobit or local exponential model may be
added in future versions.
30. The formulas for the higher-order kernel functions are given in footnote 16.
M. Fr olich and B. Melly 455
If a list of values is supplied for bandwidth(), cross-validation is used with respect
to each value in this list to estimate the bandwidth among the proposed values. If a
list of values is supplied for bandwidth() and for lambda(), cross-validation consid-
ers all pairwise combinations from these two lists. In case of local multicollinearity,
the bandwidth is progressively increased until the multicollinearity problem disap-
pears.
31
lambda(#
_
#
_
# . . .

) is used to smooth over the dummy and unordered discrete
variables. It must be between 0 and 1. A value of 0 implies that only observations
within the cell dened by all discrete regressors are used to estimate the conditional
mean. The default is lambda(1), which corresponds to global smoothing. If a list of
values is supplied for lambda(), cross-validation is used with respect to each value
in this list to estimate the lambda among the proposed values. If a list of values is
supplied for bandwidth() and for lambda(), cross-validation considers all pairwise
combinations from these two lists.
logit activates the local logit estimator. If it is not activated, the local linear estimator
is used as the default.
32
mata opt selects the ocial optimizer introduced in Stata 10, Matas optimize(), to
obtain the local logit. The default is a simple GaussNewton algorithm written for
this purpose. This option is only relevant when the logit option has been specied.
sample(varname
_
, replace

) species the name of the variable that marks the esti-


mation sample. This is similar to the function e(sample) for e-class commands.
31. In case of multicollinearity, h is increased repeatedly until the problem disappears. If multicollinear-
ity is still present at h = 100, then is increased repeatedly. Note that locreg rst examined
whether multicollinearity is a problem in the global model (h = , = 1 ) before attempting to
estimate locally.
32. This is dierent from ivqte, where local logit is the default for binary dependent variables.
456 Estimation of quantile treatment eects with Stata
B.5 Saved results
locreg saves the following in r():
Scalars
r(N) number of observations
r(optb) optimal bandwidth
r(optl) optimal lambda
r(best mse) smallest value of the cross-validation criterion
Macros
r(command) locreg
r(depvar) name of the dependent variable
r(continuous) name of the continuous covariates
r(dummy) name of the binary covariates
r(unordered) name of the unordered covariates
r(kernel) kernel function
r(model) linear or logistic model used
r(optimization) algorithm used
Matrices
r(cross valid) bandwidths, lambda, and resulting values of the cross-validation criterion
B.6 Examples
We briey illustrate the use of locreg with a few examples. We use card.dta and
keep only 200 observations to keep the computational time reasonable for this illus-
tration. (Because of missing values on covariates, eventually only 184 observations are
retained in the estimation.) The aim is to estimate the probability of living near a
four-year college (nearc4) as a function of experience, exper (continuous() variable);
mothers education, motheduc (ordered discrete); region (unordered()); and black
(dummy()). locreg can be used in three dierent ways. First, if only one value is
given in bandwidth(#) and one in lambda(#), locreg estimates the nonparametric
regression using these values h and and saves the tted values in newvarname:
. use card, clear
. set seed 123
. sample 200, count
(2810 observations deleted)
. locreg nearc4, generate(fitted1) bandwidth(0.5) lambda(0.5)
> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )
The fitted1 variable contains the estimated probabilities. Because some of them
turn out to be negative and others to be larger than one, we may prefer to t a local
logit regression and add the logit option:
. locreg nearc4, generate(fitted2) bandwidth(0.5) lambda(0.5)
> continuous(exper motheduc) dummy(black) unordered(region) logit
(output omitted )
M. Fr olich and B. Melly 457
locreg can also be used to estimate the smoothing parameters via leave-one-out
cross-validation. If we do not specify the generate() option but instead supply a list of
values in the bandwidth() or the lambda() option (or both), only the cross-validation
is performed:
. locreg nearc4, bandwidth(0.2 0.5) lambda(0.5 0.8) continuous(exper motheduc)
> dummy(black) unordered(region)
(output omitted )
In this example, the cross-validation criterion is calculated for each of the four cases:
(h, ) = (0.2, 0.5), (0.2, 0.8), (0.5, 0.5), and (0.5, 0.8). The scalars r(optb) and r(optl)
indicate the values that minimized the cross-validation criterion. In our example, we
obtain

h = 0.2 and

= 0.5. The cross-validation results are saved in the matrix
r(cross valid) for every h and combination of the search grid.
If we would like to include in our cross-validation search the values to innity, that is,
the global parametric model, we would supply a missing value for h and a value of 1 for
. For example, specifying bandwidth(0.2 .) lambda(0.5 1) implies that the cross-
validation criterion is calculated for each of the four cases: (h, ) = (0.2, 0.5), (0.2, 1),
(, 0.5), and (, 1). Similarly, specifying bandwidth(0.2 0.5 .) lambda(0.5 0.8
1) implies a search grid with nine values: (h, ) = (0.2, 0.5), (0.2, 0.8), (0.2, 1), (0.5, 0.5),
(0.5, 0.8), (0.5, 1), (, 0.5), (, 0.8), and (, 1).
Finally, if several values are specied for the smoothing parameters and the
generate() option is also activated, then locreg rst estimates

h and

via cross-
validation and thereafter returns the tted values obtained with

h and

in the fitted3
variable.
. locreg nearc4, generate(fitted3) bandwidth(0.2 0.5) lambda(0.5 0.8)
> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )
The Stata Journal (2010)
10, Number 3, pp. 458481
Translation from narrative text to standard
codes variables with Stata
Federico Belotti
University of Rome Tor Vergata
Rome, Italy
federico.belotti@uniroma2.it
Domenico Depalo
Bank of Italy
Rome, Italy
domenico.depalo@bancaditalia.it
Abstract. In this article, we describe screening, a new Stata command for data
management that can be used to examine the content of complex narrative-text
variables to identify one or more user-dened keywords. The command is useful
when dealing with string data contaminated with abbreviations, typos, or mistakes.
A rich set of options allows a direct translation from the original narrative string
to a user-dened standard coding scheme. Moreover, screening is exible enough
to facilitate the merging of information from dierent sources and to extract or
reorganize the content of string variables.
Editors note. This article refers to undocumented functions of Mata, meaning that
there are no corresponding manual entries. Documentation for these functions is
available only as help les; see help regex.
Keywords: dm0050, screening, keyword matching, narrative-text variables, stan-
dard coding schemes
1 Introduction
Many researchers in varied elds frequently deal with data collected as narrative text,
which are almost useless unless treated. For example,
Electronic patient records (EPRs) are useful for decision making and clinical re-
search only if patient data that are currently documented as narrative text are
coded in standard form (Moorman et al. 1994).
When dierent sources of data use dierent spellings to identify the same unit of in-
terest, the information can be exploited only if codes are made uniform (Raciborski
2008).
Because of verbatim responses to open-ended questions, survey data items must
be converted into nominal categories with a xed coding frame to be useful for
applied research.
These are only three of the many critical examples that motivate an ad hoc command.
Recoding a narrative-text variable into a user-dened standard coding scheme is cur-
rently possible in Stata by combining standard data-management commands (for exam-
ple, generate and replace) with regular expression functions (for example, regexm()).
c 2010 StataCorp LP dm0050
F. Belotti and D. Depalo 459
However, many problems do not yield easily to this approach, especially problems con-
taining complex narrative-text data. Consider, for example, the case when many source
variables can be used to identify a set of keywords; or the case when, looking at dierent
keywords, one is within a given source variable but not necessarily at the beginning of
that variable, whereas the others are at the beginning, the end, or within that or other
source variables. Because no command jointly handles all possible cases, these cases
can be treated with existing Stata commands only after long and tedious programming,
increasing the possibility of introducing errors. We developed the screening command
to ll this gap, simplifying data-cleaning operations while being exible enough to cover
a wide range of situations.
In particular, screening checks the content of one or more string variables (sources)
to identify one or more user-dened regular expressions (keywords). Because string vari-
ables are not exible, to make the command easier and more useful, a set of options
reduces your preparatory burden. You can make the matching task wholly case in-
sensitive or set matching rules aimed at matching keywords at the beginning, the end,
or within one or more sources. If source variables contain periods, commas, dashes,
double blanks, ampersands, parentheses, etc., it is possible to perform the matching by
removing such undesirable content. Moreover, if the matching task becomes more dif-
cult because of abbreviations or even pure mistakes, screening allows you to specify
the number of letters to screen in a keyword. Finally, the command allows a direct
translation of the original string variables in a user-dened standard coding scheme.
All these features make the command simple, extremely exible, and fast, minimizing
the possibility of introducing errors. It is worth emphasizing that we nd Mata more
convenient to use than Stata, with advantages in terms of time execution.
The article is organized as follows. In section 2, we describe the new screening
command, and we provide some useful tips in section 3. Section 4 illustrates the main
features of the command using EPR data, while section 5 details some critical cases in
which the use of screening may aid your decision to merge data from dierent sources
or to extract and reorder messy data. In the last section, section 6, we oer a short
summary.
2 The screening command
String variables are useful in many practical circumstances. A drawback is that they
are not so exible: for example, in EPR data, coding CHOLESTEROL is dierent from
coding CHOLESTEROL LDL, although the broad pathology is the same. Stata and Mata
oer many built-in functions to handle strings. In particular, screening extensively
uses the Mata regular-expression functions regexm(), regexr(), and regexs().
(Continued on next page)
460 From narrative text to standard codes
2.1 Syntax
screening
_
if
_
in

, sources(varlist
_
, sourcesopts

) keys(
_
matching rule

"string"
_
. . .

)
_
letters(#) explore(type) cases(newvar) newcode(newvar
_
, newcodeopts

) recode(recoding rule "user dened code"


_
recoding rule
"user dened code" . . .

) checksources tabcheck memcheck nowarnings save
time

2.2 Options
sources(varlist
_
, sourcesopts

) species one or more string source variables to be


screened. sources() is required.
sourcesopts description
lower perform a case-insensitive match (lowercase)
upper perform a case-insensitive match (uppercase)
trim match keywords by removing leading and trailing blanks
from sources
itrim match keywords by collapsing sources with consecutive
internal blanks to one blank
removeblank match keywords by removing from sources all blanks
removesign match keywords by removing from sources the following
signs: * + ? / \ % ( ) [ ] { } | . ^ - _ # $
keys(
_
matching rule

"string" . . . ) species one or more regular expressions (key-


words) to be matched with source variables. keys() is required.
matching rule description
begin match keywords at beginning of string
end match keywords at end of string
letters(#) species the number of letters to be matched in a keyword. The number
of letters can play a critical role: specifying a high number of letters may cause
the number of matched observations to be articially low because of mistakes or
abbreviations in the source variables; on the other hand, matching a small number
of letters may cause the number of matched observations to be articially high
because of the inclusion of uninteresting cases containing the too short keyword.
The default is to match keywords as a whole.
F. Belotti and D. Depalo 461
explore(type) allows you to explore screening results.
type description
tab tabulate all matched cases for each keyword within each source variable
count display a table of frequency counts of all matched cases for each
keyword within each source variable
cases(newvar) generates a set of categorical variables (as many as the number of key-
words) showing the number of occurrences of each keyword within all specied source
variables.
newcode(newvar
_
, newcodeopts

) generates a new (numeric) variable that contains the


position of the keywords or the regular expressions in keys(). The coding process
is driven by the order of keywords or regular expressions.
newcodeopts description
replace replace newvar if it already exists
add obtain newvar as a concatenation of subexpressions returned by
regexs(n), which must be specied as a
user dened code in recode
label attach keywords as value labels to newvar
numeric convert newvar from string to numeric; it can be specied only if
the recode() option is specied
recode(recoding rule "user dened code"
_
recoding rule "user dened code" ...

)
recodes the newcode() newvar according to a user-dened coding scheme. recode()
must contain at least one recoding rule followed by one user dened code. When you
specify recode(1 "user dened code"), the "user dened code" will be used to re-
code all matched cases from the rst keyword within the list specied via the keys()
option. If recode(2,3 "user dened code") is specied, the "user dened code" will
be used to recode all cases for which second and third keywords are simultaneously
matched, and so on. This option can only be specied if the newcode() option is
specied.
checksources checks whether source variables contain special characters. If a match-
ing rule is specied (begin or end via keys()), checksources checks the sources
boundaries accordingly.
tabcheck tabulates all cases from checksources. If there are too many cases, the
option does not produce a table.
memcheck performs a preventive memory check. When memcheck is specied, the
command will exit promptly if the allocated memory is insucient to run screening.
When memory is insucient and screening is run without memcheck, the command
could run for several minutes or even hours before producing the message no room
to add more variables.
462 From narrative text to standard codes
nowarnings suppresses all warning messages.
save saves in r( ) the number of cases detected, matching each source with each key-
word.
time reports elapsed time for execution (seconds).
3 Tips
The low exibility of string variables is a reason for concern. In this section, we provide
some tips to enhance the usefulness of screening. Some tips are useful to execute the
command, while other tips are useful to check the results.
Most importantly, capitalization matters: this means that screening for KEYWORD is
dierent from screening for keyword. If source variables contain HEMINGWAY and you are
searching for Hemingway, screening will not identify such keyword. If suboption upper
(lower) is specied in sources(), keywords will be automatically matched in uppercase
(lowercase).
Choose an appropriate matching rule. The screening default is to match keywords
over the entire content of source variables. By specifying the matching rule begin or
end within the keys() option, you may switch accordingly the matching rule on string
boundaries. For example, if sources contain HEMINGWAY ERNEST and ERNEST HEMINGWAY
and you are searching begin HEMINGWAY, the screening command will identify the
keyword only in the former case. Whether the two cases are equivalent must be evaluated
case by case.
Another issue is how to choose the optimal number of letters to be screened. For
example, with EPR data, dierent physicians might use dierent abbreviations for the
same pathologies. And so talking about a right number of letters is nonsense. As
a rule of thumb, the number of letters should be specied as the minimum number
that uniquely identies the case of interest. Using many letters can be too exclusive,
while using few letters can be too inclusive. In all cases, but in particular when the
appropriate number of letters is unknown, we nd it useful to tabulate all matched cases
through the explore(tab) option. Because it tabulates all possible matches between
all keywords and all source variables, it is the fastest way to explore the data and choose
the best matching strategy (in terms of keywords, matching rule, and letters).
Advanced users can maximize the potentiality of screening by mixing keywords
with Stata regular-expression operators. Mixing in operators allows you to match more-
complex patterns, as we show later in the article.
1
For more details on regular-expression
syntaxes and operators, see the ocial documentation at
http://www.stata.com/support/faqs/data/regex.html.
1. The letters() option does not work if a keyword contains regular-expression operators.
F. Belotti and D. Depalo 463
screening displays several messages to inform you about the eects of the specied
options. For example, consider the case in which you are searching some keywords con-
taining regular-expression operators. screening will display a message with the correct
syntax to search a keyword containing regular-expression operators. The nowarnings
option allows you to suppress all warning messages.
screening generates several temporary variables (proportional to the number of
keywords you are looking for and to the number of sources you are looking from). So
when you are working with a big dataset and your computer is limited in terms of
RAM, it might be a good idea to perform a preventive memory check. When the
memcheck option is specied and the allocated memory is insucient, screening will
exit promptly rather than running for several minutes or even hours before producing
the message no room to add more variables.
We conclude this section with an evaluation of the command in terms of time ex-
ecution using dierent Stata avors and dierent operating systems. In particular, we
compare the latest version of screening written using Mata regular-expression func-
tions with its beta version written entirely using the Stata counterpart. We built three
datasets of 500,000 (A), 5 million (B), and 50 million (C) observations with an ad hoc
source variable containing 10 dierent words: HEMINGWAY, FITZGERALD, DOSTOEVSKIJ,
TOLSTOJ, SAINT-EXUPERY, HUGO, CERVANTES, BUKOWSKI, DUMAS, and DESSI. Screening
for HEMINGWAY (50% of total cases) gives the following results (in seconds):
Stata avor and Mata Stata
operating system A B C A B C
Stata/SE 10 (32-bit) and
0.66 6.67 na 0.93 9.24 na
Mac OS X 10.5.8 (64-bit)

Stata/MP 11 (64-bit) and


0.60 5.66 na 0.85 7.73 na
Mac OS X 10.5.8 (64-bit)

Stata/MP 11 (64-bit) and


0.37 3.70 37.22 0.70 7.06 70.59
Window Server 2003 (64-bit)
+

Intel Core 2 Duo 2.2 GHz (dual core) with 4 GB RAM


+
AMD Opteron 2.2 GHz (quad core) with 20 GB RAM
The table speaks for itself!
4 Example
To illustrate the command, we use anonymized patient-level data from the Health Search
database, a nationally representative panel of patients run by the Italian College of
General Practitioners (Italian Society of General Medicine). Our sample contains freely
inputted EPRs concerning the prescription of diagnostic tests.
2
A list of 15 observations
2. The original data are in Italian. Where necessary for comprehension, we translate to English.
464 From narrative text to standard codes
from the uppercase source variable diagn test description provides an overview of
cases at hand:
. list diagn_test_descr in 1/15, noobs separator(20)
diagn_test_descr
TRIGLICERIDI
EMOCROMO FORMULA
COLESTEROLO TOTALE
ALTEZZA
PT TEMPO PROTROMBINA
VISITA CARDIOLOGICA CONTROLLO
HCV AB EPATITE C
COMPONENTE MONOCLONALE
ATTIVITA FISICA
PSA ANTIGENE PROSTATICO SPECIFICO
RX CAVIGLIA SN
FAMILIARITA K UTERO
TRIGLICERIDI
URINE ESAME COMPLETO
URINE PESO SPECIFICO
As you can see, this is a rich EPR dataset that is totally useless unless treated. If
data were collected for research purposes, physicians would be given a nite number of
possible options. There is much agreement in the scientic community that the cost to
leave the burden of inputting standard codes directly to physicians at the time of contact
with the patient is higher than the relative benet: the task is extremely onerous, it is
unrelated to the physicians primary job, and most importantly, it requires extra eort.
Therefore, the common view supports the implementation of data-entry methods that
do not disturb the physicians workow (Yamazaki and Satomura 2000).
From the above list of observations, it is also clear that free-text data entry provides
physicians with the freedom to determine the order and detail at which they want
to input data. Even if the original free-text data were complete, it would still be
dicult to extract standardized and structured data from this kind of record because
of abbreviations, typos, or mistakes (Moorman et al. 1994). Extracting data in the
presence of abbreviations and typos is exactly what screening allows you to do.
As a practical example, we focus on the identication of dierent types of cholesterol
tests. In particular, our aim is to create a new variable (diagn test code) containing
cholesterol test codes according to the Italian National Health System coding scheme.
Because at least three types of cholesterol test exist, namely, hdl, ldl, and total, our
matching strategy must take into account that a physician can input 1) only the types
of the test, 2) only its broad denition (cholesterol), or 3) both, without considering
abbreviations, typos, mistakes, and further details.
F. Belotti and D. Depalo 465
Thus we rst explore the data by running screening with the explore(tab) option:
. screening, sources(diagn_test_descr, lower) keys(colesterolo) explore(tab)
Cases of colesterolo found in diagn_test_descr
colesterolo Freq. Percent Cum.
colesterolo totale 2,954 51.86 51.86
hdl colesterolo 1,854 32.55 84.41
ldl colesterolo 617 10.83 95.24
colesterolo hdl 117 2.05 97.30
colesterolo ldl 37 0.65 97.95
colesterolo tot 28 0.49 98.44
colesterolo 24 0.42 98.86
colesterolo hdl sangue 16 0.28 99.14
colesterolo totale sangue 16 0.28 99.42
colesterolo esterificato 4 0.07 99.49
colesterolo tot. 4 0.07 99.56
colesterolo hdl 90.14.1 3 0.05 99.61
colesterolo totale 90143 3 0.05 99.67
colesterolo libero 2 0.04 99.70
colesterolo stick 2 0.04 99.74
colesterolo tot hdl 2 0.04 99.77
colesterolo totale 90.143 2 0.04 99.81
ultima misurazione colesterolo 2 0.04 99.84
colesterolo hdl 1 0.02 99.86
colesterolo ldl 90.14.2 1 0.02 99.88
colesterolo non ldl 1 0.02 99.89
colesterolo t. mg/dl 1 0.02 99.91
colesterolo tot. c 1 0.02 99.93
colesterolo tot. hdl 1 0.02 99.95
colesterolo tot., 1 0.02 99.96
colesterolo totale h 1 0.02 99.98
rich,specialistica colesterolo trigl 1 0.02 100.00
Total 5,696 100.00
Here the lower suboption makes the matching task case insensitive. Apart from the
explore(tab) option, the syntax above is compulsory and performs what we call a
default matching, that is, an exact match of the keyword colesterolo over the entire
content of the source variable diagn test descr. The tabulation above (notice the
lowercase) informs you that the keyword colesterolo is encountered in 5,696 cases.
What do these cases contain? Because you did not instruct the command to match a
shorter length of the keyword, the only possible case is the keyword itself; all the cases
contain the keyword colesterolo.
Given the nature of the data, it might be convenient to run screening with a
shorter length of the keyword so as to nd possible partial matching in the presence
of abbreviations or mistakes. The letters(#) option instructs screening to perform
the match on a shorter length:
(Continued on next page)
466 From narrative text to standard codes
. screening, sources(diagn_test_descr, lower) keys(colesterolo) letters(5)
> explore(tab)
Cases of coles found in diagn_test_descr
coles Freq. Percent Cum.
colesterolo totale 2,954 37.25 37.25
hdl colesterolo 1,854 23.38 60.62
coles ldl 1,343 16.93 77.56
hdl colest 853 10.76 88.31
ldl colesterolo 617 7.78 96.09
colesterolo hdl 117 1.48 97.57
colesterolo ldl 37 0.47 98.03
colesterolo tot 28 0.35 98.39
colesterolo 24 0.30 98.69
colesterolo hdl sangue 16 0.20 98.89
colesterolo totale sangue 16 0.20 99.09
colesterolemia 14 0.18 99.27
hdl colest. 5 0.06 99.33
colest.tot. 4 0.05 99.38
colesterolo esterificato 4 0.05 99.43
colesterolo tot. 4 0.05 99.48
azotemia glicemia colest 3 0.04 99.52
colest. hdl 3 0.04 99.56
colesterolo hdl 90.14.1 3 0.04 99.60
colesterolo totale 90143 3 0.04 99.63
colesterolo libero 2 0.03 99.66
colesterolo stick 2 0.03 99.68
colesterolo tot hdl 2 0.03 99.71
colesterolo totale 90.143 2 0.03 99.74
ldl colest. 2 0.03 99.76
ultima misurazione colesterolo 2 0.03 99.79
colest. ldl 1 0.01 99.80
colest. tot. 1 0.01 99.81
colest.tot 1 0.01 99.82
colester.tot.hdl, 1 0.01 99.84
colesterolo hdl 1 0.01 99.85
colesterolo ldl 90.14.2 1 0.01 99.86
colesterolo non ldl 1 0.01 99.87
colesterolo t. mg/dl 1 0.01 99.89
colesterolo tot. c 1 0.01 99.90
colesterolo tot. hdl 1 0.01 99.91
colesterolo tot., 1 0.01 99.92
colesterolo totale h 1 0.01 99.94
emocromo c. colester 1 0.01 99.95
glicemia colesterolemia- 1 0.01 99.96
got gpt colest / trigli/creat/emocromo 1 0.01 99.97
rich,specialistica colesterolo trigl 1 0.01 99.99
uricemia uricuria colest 1 0.01 100.00
Total 7,931 100.00
By specifying a ve-letter partial match, screening detects 2,235 new cases of
cholesterol tests. By further reducing the number of letters, we get the following result:
3
3. Because of space restrictions, we deliberately omit the complete tabulation obtainable with the
explore(tab) option. It is available upon request.
F. Belotti and D. Depalo 467
. screening, sources(diagn_test_descr, lower) keys("colesterolo") letters(3)
> explore(tab)
Cases of col found in diagn_test_descr
col Freq. Percent Cum.
colesterolo totale 2,954 23.45 23.45
col tot 2,034 16.15 39.60
hdl colesterolo 1,854 14.72 54.32
coles ldl 1,343 10.66 64.99
hdl colest 853 6.77 71.76
ldl colesterolo 617 4.90 76.66
urinocoltura coltura urina 326 2.59 79.25
v.ginecologica 161 1.28 80.52
eco tiroide eco capo e collo 150 1.19 81.71
colesterolo hdl 117 0.93 82.64
(output omitted )
colesterolo ldl 37 0.29 90.77
calcolo rischio cardiovascolare (iss) 35 0.28 91.04
coprocoltura coltura feci 33 0.26 91.31
colore 32 0.25 91.56
ecocolordoppler arti inf. art. 32 0.25 91.81
urinocoltura 32 0.25 92.07
colposcopia 31 0.25 92.31
colesterolo tot 28 0.22 92.54
reticolociti 28 0.22 92.76
ecodoppler a.inferiori ecocolor venosa 27 0.21 92.97
eco ginecologica 25 0.20 93.17
colesterolo 24 0.19 93.36
rischio cardio vascolare nota 13 23 0.18 93.55
rischio cardiovascolare % a 10 anni 22 0.17 93.72
ecodoppler a.inferiori ecocolor arter. 19 0.15 93.87
(output omitted )
col hdl 3 0.02 97.13
colest. hdl 3 0.02 97.16
colesterolo hdl 90.14.1 3 0.02 97.18
colesterolo totale 90143 3 0.02 97.21
conta batt.,urinocoltura, antibiogramma 3 0.02 97.23
eco cardiaca con doppler e colordoppler 3 0.02 97.25
eco color/doppl.car. ver 3 0.02 97.28
eco(color)dopplergrafia 3 0.02 97.30
ecocardiografia colordoppler 3 0.02 97.32
ecocolordoppler art.aa.inf. 3 0.02 97.35
ecocolordoppler arterioso arti inferior 3 0.02 97.37
ecocolordoppler tronchi sovraortici 3 0.02 97.40
ecocolordopplergrafia cardiaca 3 0.02 97.42
ecografia muscolotendinea 3 0.02 97.44
ecografia tiroide eco capo e collo 3 0.02 97.47
familiarita ev.cerebrovascol.( 72m 74f 3 0.02 97.49
immunocomplessi circolanti 3 0.02 97.51
rx digerente (tenue e colon) 3 0.02 97.54
test broncodilatazione farmacologica 3 0.02 97.56
test cardiovascolare da sforzo con cicl 3 0.02 97.59
test sforzo cardiovascol. pedana mobile 3 0.02 97.61
urinocoltura atb+mic 3 0.02 97.63
urinocoltura con antibiogramma 3 0.02 97.66
urinocoltura identificazione batt.+ ab 3 0.02 97.68
che colinesterasi 2 0.02 97.70
col 2 0.02 97.71
468 From narrative text to standard codes
colangio rm 2 0.02 97.73
colesterolo libero 2 0.02 97.75
colesterolo stick 2 0.02 97.76
colesterolo tot hdl 2 0.02 97.78
colesterolo totale 90.143 2 0.02 97.79
(output omitted )
ldl colest. 2 0.02 98.11
(output omitted )
col tot 216 hdl 58 fibri 1 0.01 98.48
col=245ldl=193tr=91 1 0.01 98.48
colangiografia intravenosa 1 0.01 98.49
colecistografia 1 0.01 98.50
colecistografia per os c 1 0.01 98.51
colest. ldl 1 0.01 98.52
colest. tot. 1 0.01 98.52
colest.tot 1 0.01 98.53
colester.tot.hdl, 1 0.01 98.54
colesterolo hdl 1 0.01 98.55
colesterolo ldl 90.14.2 1 0.01 98.55
colesterolo non ldl 1 0.01 98.56
colesterolo t. mg/dl 1 0.01 98.57
colesterolo tot. c 1 0.01 98.58
colesterolo tot. hdl 1 0.01 98.59
colesterolo tot., 1 0.01 98.59
colesterolo totale h 1 0.01 98.60
colloquio psicologico 1 0.01 98.61
(output omitted )
hdl col 1 0.01 99.22
(output omitted )
visita specialistica colonscopia con bi 1 0.01 99.99
yersinia coltura feci 1 0.01 100.00
Total 12,595 100.00
Again screening detects new cases: 2,034 cases characterized by the abbreviation
col tot (that is, total cholesterol) that are impossible to identify without further re-
ducing the number of letters. The problem is that, among all matched cases (12,595),
there are also a number of unwanted cases, that is, cases containing the same spelling
of the keyword but related to another type of diagnostic test. Despite this incorrect
identication, we will show later in the section how to obtain a new recoded variable
by specifying the appropriate recoding rule as an argument of the recode() option.
The number of letters you match plays a critical role: specifying a high number
of letters may cause the number of matched observations to be articially low due to
mistakes or abbreviations in the source variables; on the other hand, matching a small
number of letters may cause the number of matched observations to be articially high
due to the inclusion of uninteresting cases containing the too short keyword.
F. Belotti and D. Depalo 469
As mentioned above, we are interested in the identication of three types of choles-
terol tests. To achieve this objective, in what follows we focus on a set of four keywords
(totale, colesterolo, ldl, hdl) with three identifying letters. We also specify the
newcode() option to generate a new variable recoding the observations that match the
specied keywords.
At this point, we describe more deeply the recoding mechanism of screening:
If newcode() is specied, a new variable is generated, taking as values the position
of the keywords or regular expressions specied through the keys() option. The
coding process is driven by the order of keywords or regular expressions.
If recode() is specied, the newcode() newvar suboption is recoded according to
the user-dened coding scheme.
Thus a rst recoding of the source variable can be obtained as follows:
. screening, sources(diagn_test_descr, lower)
> keys("totale" "colesterolo" "ldl" "hdl") letters(3 3 3 3) explore(count)
> newcode(tmp_diagn_test_code)
Source Key Freq. Percent
diagn_test_descr tot 7304 29.47
col 12595 50.81
ldl 2015 8.13
hdl 2872 11.59
Total 24786 100.00
. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.
1 7,304 49.15 49.15
2 7,535 50.70 99.85
3 12 0.08 99.93
4 11 0.07 100.00
Total 14,862 100.00
The explore(count) option instructs screening to display a table of frequency
counts of all matched cases. The newcode() option creates tmp diagn test code, which
is a new variable that takes as values the position of the keywords or regular expressions
specied through the keys() option. The coding process is driven by the order of
keywords or regular expressions: the number 1 is associated with the 7,304 observations
matching the rst keyword, tot; the number 2 is associated with the 7,535 observations
matching the second keyword, col; and so on. Hence, by specifying keys("totale"
"colesterolo" "ldl" "hdl") together with letters(3 3 3 3), tot takes precedence
over col in the recoding process. This means that if some observations are recoded
according to the rst keyword match, they will not be recoded according to the following
keywords in the keys() list, even if they match.
470 From narrative text to standard codes
For this reason, the best recoding strategy is to rst specify keywords that uniquely
identify the cases of interest. Because keywords hdl and ldl each uniquely identify a
cholesterol test, they must have priority in the recoding process over totale, which is
an extension common to other pathologies.
Indeed, when we reverse the order of the keywords and specify the replace suboption
in the newcode() option, screening produces
. screening, sources(diagn_test_descr, lower)
> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(tmp_diagn_test_code, replace)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.
1 2,872 19.32 19.32
2 2,015 13.56 32.88
3 7,731 52.02 84.90
4 2,244 15.10 100.00
Total 14,862 100.00
where the newcode() variable now identies all hdl and ldl cases. Notice that here
we followed the correct approach, from specic to general. Moreover, as shown by the
following code, when we specify the newcode() suboption label, screening attaches
the specied keywords as value labels to the newcode() variable.
. screening, sources(diagn_test_descr, lower)
> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(tmp_diagn_test_code, replace label)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
. tabulate tmp_diagn_test_code
tmp_diagn_t
est_code Freq. Percent Cum.
hdl 2,872 19.32 19.32
ldl 2,015 13.56 32.88
colesterolo 7,731 52.02 84.90
totale 2,244 15.10 100.00
Total 14,862 100.00
The last step toward recoding is achieved by using the recode() option. This option
allows you to recode the newcode() variable according to a user-dened coding scheme.
When you specify this option, the coding process is completely under your control.
The recode() option requires a recoding rule followed by a "user dened code" (the
"user dened code" must be enclosed within double quotes).
When we specify recode(1 "90.14.1" ...), the standard code "90.14.1" will
be used to recode all matched cases from the rst keyword (hdl); when we specify
F. Belotti and D. Depalo 471
recode(... 2 "90.14.2" ...), the standard code "90.14.2" will be used to recode
all matched cases from the second keyword (ldl); and so on. The third and forth
keywords deserve special attention. totale (which was specied as the forth keyword,
hence position 4) is a common extension that we want to identify only when it is
matched simultaneously with colesterolo (which was specied as the third keyword,
hence position 3). Thus the appropriate syntax in this case will be recode(... 3,4
"90.14.3" ...). Finally, when we specify recode(... 3 "not class. tests"), the
code "not class. tests" will be used to recode all matched cases from the third
keyword (colesterolo) that are not classied because they do not contain any further
specication.
The nal syntax of our example is
. screening, sources(diagn_test_descr, lower)
> keys("hdl" "ldl" "colesterolo" "totale") letters(3 3 3 3)
> newcode(diagn_test_code)
> recode(1 "90.14.1" 2 "90.14.2" 3,4 "90.14.3" 3 "not class. tests")
. tabulate diagn_test_code
diagn_test_code Freq. Percent Cum.
90.14.1 2,872 22.76 22.76
90.14.2 2,015 15.97 38.73
90.14.3 5,055 40.06 78.79
not class. tests 2,676 21.21 100.00
Total 12,618 100.00
As the tabulate command shows, the new variable diagn_test_code is created
according to the user-dened codes. Notice that only 5,055 cases are coded as total
cholesterol (90.14.3). A two-way tabulate command (below) helps to highlight that
2,244 cases have to be considered incorrect identicationsthat is, cases containing the
same spelling of the keywords (totale) but related to other types of diagnostic tests
4
whereas 2,676 are incomplete because they contain only colesterolo without further
specication.
. tabulate diagn_test_code tmp_diagn_test_code if tmp_diagn_test_code !=., m
tmp_diagn_test_code
diagn_test_code hdl ldl colestero totale Total
0 0 0 2,244 2,244
90.14.1 2,872 0 0 0 2,872
90.14.2 0 2,015 0 0 2,015
90.14.3 0 0 5,055 0 5,055
not class. tests 0 0 2,676 0 2,676
Total 2,872 2,015 7,731 2,244 14,862
This example shows that screening is a simple tool to manage complex string vari-
ables. Once you have obtained structured data (in our example, a categorical variable
indicating cholesterol tests), you can nally start your statistical analysis.
4. Because of space restrictions, we deliberately omit the tabulation of such cases. It is available upon
request.
472 From narrative text to standard codes
5 Extensions
Although the main utility of screening is the direct translation of complex narrative-
text variables in a user-dened coding scheme, the command is exible enough to cover
a wide range of situations. In section 5.1, we present an example of how to use the
command to facilitate the merging of information from dierent sources, while in sec-
tion 5.2, we show how to use screening to extract or rearrange a portion of a string
variable.
5.1 Merging from dierent sources
In applied studies, a classic problem comes from trying to merge information from dif-
ferent sources that use dierent codes for the same units. A recently released command,
kountry (Raciborski 2008), is an important step toward a solution.
The kountry command can be used to facilitate the merging of information from
dierent sources by recoding a string variable into a standardized form. This recoding is
possible using a custom dictionary created through a helper command.
5
In this section,
we show an alternative way to merge information from dierent sources by using the
screening command.
As an example, we try to merge two Italian datasets, one provided by the National
Statistical Oce (National Institute of Statistics in Italy) and the other provided by the
Italian Ministry of the Interior. The two datasets contain, for each Italian municipality,
the complete name and an alphanumeric code, the latter being dierent across sources.
In theory, with the (uniquely identied) name of each municipality, it should be easy to
merge the two datasets.
We rst proceed by matching the two original datasets:
. use istat, clear
. sort comune
. merge m:m comune using ministero
(output omitted )
. tabulate _merge
_merge Freq. Percent Cum.
master only (1) 288 3.43 3.43
using only (2) 290 3.46 6.89
matched (3) 7,812 93.11 100.00
Total 8,390 100.00
5. See help kountryadd (if kountry is installed).
F. Belotti and D. Depalo 473
As you can see, there are 288 inconsistencies.
6
When we tabulate the unmatched
cases, we would realize that unconventional expressions, like apostrophes, accents, dou-
ble names, etc., are responsible for this imperfect result:
. preserve
. sort comune
. drop if _merge==3
(7812 observations deleted)
. list comune _merge in 1/20, separator(20) noobs
comune _merge
AGLIE 2
AGLI 1
ALA DEI SARDI 2
ALBISOLA MARINA 2
ALBISOLA SUPERIORE 2
ALBISSOLA MARINA 1
ALBISSOLA SUPERIORE 1
ALI 2
ALI TERME 2
ALLUVIONI CAMBIO 2
ALLUVIONI CAMBI 1
ALME 2
ALM 1
AL DEI SARDI 1
AL 1
AL TERME 1
ANTEY-SAINT-ANDRE 2
ANTEY-SAINT-ANDR 1
APPIANO SULLA STRADA DEL 2
APPIANO SULLA STRADA DEL VINO 1
. restore
If you wish to recover all 288 unmatched municipalities, the proposed command is
a simple and fast solution. Indeed, when you take advantage of the available options,
you can (almost) completely recover unmatched cases with only one command. As an
example, we recover nine cases (it is possible to recover all cases with this procedure),
with a loop running on values of merge equal to 1 or 2, that is, running only on
unmatched cases:
6. The number of unmatched cases is dierent between the master (288) and the using (290) datasets
because of aggregation and separation of municipalities. Solving this kind of problem is beyond
the illustrative scope of this example.
474 From narrative text to standard codes
. forvalues i=1/2 {
2. preserve
3. keep if _merge==`i
4.
. screening, sources(comune) keys("ALBISSOLA" "AQUILA DARROSCIA" "BAJARDO"
> "BARCELLONA" "BARZAN" "BRIGNANO" "CADERZONE" "CAVAGLI" "MARINA" "SUPERIORE")
> cases(cases) newcode(comune, replace)
> recode(1,9 "ALBISOLA MARINA" 1,10 "ALBISOLA SUPERIORE" 2 "AQUILA DI ARROSCIA"
> 3 "BAIARDO" 4 "BARCELLONA POZZO DI GOTTO" 5 "BARZANO" 6 "BRIGNANO FRASCATA"
> 7 "CAVAGLIA" 8 "CADERZONE TERME")
5. if `i==1 drop codice_ente
6. if `i==2 drop codice
7. keep comune codice
8. sort comune
9. save new_`i,replace
10. restore
11. }
(8102 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_1.dta not found)
file new_1.dta saved
(8100 observations deleted)
WARNING: By specifying -replace- sub-option you are overwriting the -newcode()-
> variable.
(note: file new_2.dta not found)
file new_2.dta saved
. keep if _merge==3
(578 observations deleted)
. save perfect_match, replace
(note: file perfect_match.dta not found)
file perfect_match.dta saved
. use new_1, clear
. merge 1:1 comune using new_2
(output omitted )
. tabulate _merge
_merge Freq. Percent Cum.
master only (1) 279 49.03 49.03
using only (2) 281 49.38 98.42
matched (3) 9 1.58 100.00
Total 569 100.00
. append using perfect_match
. tabulate _merge
_merge Freq. Percent Cum.
master only (1) 279 3.33 3.33
using only (2) 281 3.35 6.68
matched (3) 7,821 93.32 100.00
Total 8,381 100.00
F. Belotti and D. Depalo 475
Because we deliberately recovered only nine cases, the number of unmatched cases
before the execution of screening is improved by nine cases, from 7,812 to 7,821 exact
matches.
5.2 Extracting a piece of a string variable
In this section, we show through three examples how screening can be used to extract
or rearrange a portion of a string variable.
7
Example 1
Imagine you have the string variable address, and you want to create a new variable
that contains just the zip codes. Here is what the source variable address may look
like:
. list, noobs sep(10)
address
4905 Lakeway Drive, College Station, Texas 77845 USA
673 Jasmine Street, Los Angeles, CA 90024
2376 First street, San Diego, CA 90126
66666 West Central St, Tempe AZ 80068
12345 Main St. Cambridge, MA 01238-1234
12345 Main St Sommerville MA 01239-2345
12345 Main St Watertwon MA 01239 USA
To nd the zip code, you have to use screening with specic regular expressions,
allowing it to exactly match all cases in the source variable address. Some examples
of specic regular expressions are the following:
([0-9][0-9][0-9][0-9][0-9]) to nd a ve-digit number, the zip code
[\-]* to match zero or more dashes, - or - -
[0-9]* to match zero or more numbers, that is, the zip code plus any other
numbers
[ a-zA-Z]* to match zero or more blank spaces and (lowercase or uppercase)
letters
Once the correct regular expression(s) is found, to use screening to create a new
variable containing the zip codes, you have to do the following:
7. The following examples have been taken from the UCLA website resources to help you learn and
use Stata. See http://www.ats.ucla.edu/stat/stata/faq/regex.htm.
476 From narrative text to standard codes
1. Use the newcode() option to create the new variable zipcode.
2. Combine the above regular expressions and use them as a unique keyword.
3. Use the regexs(n) function as a "user dened code" in the recode() option.
regexs(n) returns the subexpression n from the respective keyword match, where
0 n 10. Stata regular-expression syntaxes use parentheses, (), to denote
a subexpression group. In particular, n = 0 is reserved for the entire string
that satised the regular expression (keyword); n = 1 is reserved for the rst
subexpression that satised the regular expression (keyword); and so on.
Hence, you may code
. screening, sources(address)
> keys("([0-9][0-9][0-9][0-9][0-9])[\-]*[0-9]*[ a-zA-Z]*$")
> cases(c) newcode(zipcode) recode(1 "regexs(1)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesnt work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. tabulate zipcode
zipcode Freq. Percent Cum.
01238 1 14.29 14.29
01239 2 28.57 42.86
77845 1 14.29 57.14
80068 1 14.29 71.43
90024 1 14.29 85.71
90126 1 14.29 100.00
Total 7 100.00
where recode(1 "regexs(1)") indicates that
1. 1 is the recoding rule; that is, the coding process is related to the rst (and unique)
keyword match.
2. regexs(1) is used to recode. Indeed, it returns the string related to the rst (and
unique) subexpression match.
8
As a result, the new variable zipcode is created by using only one line of code.
Notice that screening warns you that you are matching a keyword containing one or
more regular-expression operators.
8. Remember that subexpressions are denoted by using (). In the considered syntax, the only subex-
pression is represented by ([0-9][0-9][0-9][0-9][0-9]). This means that, in this case, you cannot
specify n > 1.
F. Belotti and D. Depalo 477
Example 2
Suppose you have a variable containing a persons full name. Here is what the variable
fullname looks like:
. list, noobs sep(10)
fullname
John Adams
Adam Smiths
Mary Smiths
Charlie Wade
Our goal is to swap rst name with last name, separating them by a comma. The
regular expression to reach the target is (([a-zA-Z]+)[ ]*([a-zA-Z]+)). It is com-
posed of three parts:
1. ([a-zA-Z]+) to capture a string consisting of letters (lowercase and uppercase),
that is, the rst name
2. [ ]* to match with a space(s), that is, the blank between rst and last name
3. ([a-zA-Z]+) again to capture a string consisting of letters, this time the last
name
The following is a way to proceed using screening:
. screening, sources(fullname)
> keys("([a-zA-Z]+)[ ]*([a-zA-Z]+)" "[ ]" "([a-zA-Z]+)[ ]*([a-zA-Z]+)")
> newcode(fullname, add replace) recode(1 "regexs(2)," 2 "regexs(0)"
> 3 "regexs(1)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesnt work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list fullname, noobs sep(10)
fullname
Adams, John
Smiths, Adam
Smiths, Mary
Wade, Charlie
478 From narrative text to standard codes
Notice the newcode() suboption add. It can be specied only when a regexs(n)
function is specied as a "user dened code" in the recode() option. The add suboption
allows for the creation of the newcode() variable as a concatenation of subexpressions
returned by regexs(n). In the example above,
1. recode(1 "regexs(2)," ... returns the second subexpression from the rst
keyword match (the last name) plus a comma.
2. ...2 "regexs(0)" ... returns the blank matched by the second keyword;
3. ...3 "regexs(1)") returns the rst subexpression from the third keyword match
(the rst name).
As a result, the variable fullname is replaced (note the suboption replace) sequen-
tially by the concatenation of subexpressions returned by 1, 2, and 3 above.
Example 3
Imagine that you have the string variable date containing dates:
. list date, noobs sep(20)
date
20jan2007
16June06
06sept1985
21june04
4july90
9jan1999
6aug99
19august2003
The goal is to produce a string variable with the appropriate four-digit year for each
case, which Stata can easily convert into a date. You can achieve the target by coding
something like the following:
. generate day = regexs(0) if regexm(date, "^[0-9]+")
. generate month = regexs(0) if regexm(date, "[a-zA-Z]+")
. generate year = regexs(0) if regexm(date, "[0-9]*$")
. replace year = "20"+regexs(0) if regexm(year, "^[0][0-9]$")
(2 real changes made)
. replace year = "19"+regexs(0) if regexm(year, "^[1-9][0-9]$")
(2 real changes made)
. generate date1 = day+month+year
F. Belotti and D. Depalo 479
. list, noobs sep(10)
date day month year date1
20jan2007 20 jan 2007 20jan2007
16June06 16 June 2006 16June2006
06sept1985 06 sept 1985 06sept1985
21june04 21 june 2004 21june2004
4july90 4 july 1990 4july1990
9jan1999 9 jan 1999 9jan1999
6aug99 6 aug 1999 6aug1999
19august2003 19 august 2003 19august2003
Alternately, you can obtain the same result by using screening:
. screening, sources(date) keys("^[0-9]+" "[a-zA-Z]+" "[0][0-9]$" "[1-9][0-9]$")
> newcode(date1, add)
> recode(1 "regexs(0)" 2 "regexs(0)" 3 "20+regexs(0)" 4 "19+regexs(0)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesnt work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list date date1, noobs sep(10)
date date1
20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003
Also in this case, as in the previous example, we specify the newcode() suboption add
because we need to create the newcode() variable as a concatenation of subexpressions
from keyword matching. The same result can be obtained using the following syntax:
(Continued on next page)
480 From narrative text to standard codes
. screening, sources(date)
> keys(begin "[0-9]+" "[a-zA-Z]+" end "[0][0-9]" end "[1-9][0-9]")
> newcode(date1, add)
> recode(1 "regexs(0)" 2 "regexs(0)" 3 "20+regexs(0)" 4 "19+regexs(0)")
WARNING! You are SCREENING some keywords using regular-expression operators
> like ^ . ( ) [ ] ? *
Notice that:
1) Option -letter- doesnt work IF a keyword contains regular-expression operators
2) Unless you are looking for a specific regular-expression, regular-expression
operators must be preceded by a backslash \ to ensure keyword-matching
(e.g. \^ \. )
3) To match a keyword containing $ or \, you have to specify them as [\$] [\\]
. list date date1, noobs sep(10)
date date1
20jan2007 20jan2007
16June06 16June2006
06sept1985 06sept1985
21june04 21june2004
4july90 4july1990
9jan1999 9jan1999
6aug99 6aug1999
19august2003 19august2003
where the only dierence is represented by the way in which the matching rule is spec-
ied: begin instead of ^ and end instead of $.
6 Summary
In this article, we introduced the new screening command, a data-management tool
that helps you examine and treat the content of string variables containing free, possibly
complex, narrative text. screening allows you to build new variables, to recode new
or existing variables, and to build a set of categorical variables indicating keyword
occurrences (a rst step toward textual analysis). Considerable eorts were devoted
to making the command as exible as possible; thus screening contains a rich set
of options that is intended to cover the most frequently encountered problems and
necessities. Because of this exibility, the command can be used in many dierent
elds, like EPR data, data from dierent sources, or survey data. The execution of
screening is fast, thanks to Mata programming; its syntax is simple and common to
many other Stata commands, thus it is useful for all users regardless of their levels of
experience in Stata. We especially recommend that you use the explore() option; it
makes the command a useful data-mining tool. Nevertheless, expert users can exploit
a more complicated syntax that substantially eases the preparatory burden for data
cleaning.
F. Belotti and D. Depalo 481
Acknowledgments
We would like to thank Alice Cortignani, Rossana DAmico, Andrea Piano Mortari,
and Riccardo Zecchinelli who tested the command, Vincenzo Atella who read an earlier
version of the article, Iacopo Cricelli who provided us with EPR data, and Rafal Raci-
borski for useful discussions. We are also grateful to David Drukker and all participants
at the 2009 Italian Stata Users Group meeting. Finally, the suggestions made by the
referee and the editor were useful to improve the command. We are responsible for any
remaining errors.
7 References
Moorman, P. W., A. M. van Ginneken, J. van der Lei, and J. H. van Bemmel. 1994. A
model for structured data entry based on explicit descriptional knowledge. Methods
of Information in Medicine 33: 454463.
Raciborski, R. 2008. kountry: A Stata utility for merging cross-country data from
multiple sources. Stata Journal 8: 390400.
Yamazaki, S., and Y. Satomura. 2000. Standard method for describing an electronic
patient record template: Application of XML to share domain knowledge. Methods
of Information in Medicine 39: 5055.
About the authors
Federico Belotti is a PhD student in econometrics and empirical economics at the University
of Rome Tor Vergata.
Domenico Depalo is a researcher in the Economic Research Department of the Bank of Italy
in Rome. He received his PhD in econometrics and empirical economics from the University
of Rome Tor Vergata and was enrolled in a Post Doc program at the University of Rome La
Sapienza.
The Stata Journal (2010)
10, Number 3, pp. 482495
Speaking Stata: The limits of sample skewness
and kurtosis
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
n.j.cox@durham.ac.uk
Abstract. Sample skewness and kurtosis are limited by functions of sample size.
The limits, or approximations to them, have repeatedly been rediscovered over
the last several decades, but nevertheless seem to remain only poorly known. The
limits impart bias to estimation and, in extreme cases, imply that no sample could
bear exact witness to its parent distribution. The main results are explained in a
tutorial review, and it is shown how Stata and Mata may be used to conrm and
explore their consequences.
Keywords: st0204, descriptive statistics, distribution shape, moments, sample size,
skewness, kurtosis, lognormal distribution
1 Introduction
The use of moment-based measures for summarizing univariate distributions is long
established. Although there are yet longer roots, Thorvald Nicolai Thiele (1889) used
mean, standard deviation, variance, skewness, and kurtosis in recognizably modern
form. Appreciation of his work on moments remains limited, for all too understandable
reasons. Thiele wrote mostly in Danish, he did not much repeat himself, and he tended
to assume that his readers were just about as smart as he was. None of these habits
could possibly ensure rapid worldwide dissemination of his ideas. Indeed, it was not
until the 1980s that much of Thieles work was reviewed in or translated into English
(Hald 1981; Lauritzen 2002).
Thiele did not use all the now-standard terminology. The names standard deviation,
skewness, and kurtosis we owe to Karl Pearson, and the name variance we owe to Ronald
Aylmer Fisher (David 2001). Much of the impact of moments can be traced to these
two statisticians. Pearson was a vigorous proponent of using moments in distribution
curve tting. His own system of probability distributions pivots on varying skewness,
measured relative to the mode. Fishers advocacy of maximum likelihood as a superior
estimation method was combined with his exposition of variance as central to statistical
thinking. The many editions of Fishers 1925 text Statistical Methods for Research
Workers, and of texts that in turn drew upon its approach, have introduced several
generations to the ideas of skewness and kurtosis. Much more detail on this history is
given by Walker (1929), Hald (1998, 2007), and Fiori and Zenga (2009).
c 2010 StataCorp LP st0204
N. J. Cox 483
Whatever the history and the terminology, a simple but fundamental point deserves
emphasis. A name like skewness has a very broad interpretation as a vague concept of
distribution symmetry or asymmetry, which can be made precise in a variety of ways
(compare with Mosteller and Tukey [1977]). Kurtosis is even more enigmatic: some
authors write of kurtosis as peakedness and some write of it as tail weight, but the
skeptical interpretation that kurtosis is whatever kurtosis measures is the only totally
safe story. Numerical examples given by Irving Kaplansky (1945) alone suce to show
that kurtosis bears neither interpretation unequivocally.
1
To the present, moments have been much disapproved, and even disproved, by math-
ematical statisticians who show that in principle moments may not even exist, and by
data analysts who know that in practice moments may not be robust. Nevertheless in
many quarters, they survive, and they even thrive. One of several lively elds making
much use of skewness and kurtosis measures is the analysis of nancial time series (for
example, Taylor [2005]).
In this column, I will publicize one limitation of certain moment-based measures, in a
double sense. Sample skewness and sample kurtosis are necessarily bounded by functions
of sample size, imparting bias to the extent that small samples from skewed distributions
may even deny their own parentage. This limitation has been well established and
discussed in several papers and a few texts, but it still appears less widely known than
it should be. Presumably, it presents a complication too far for most textbook accounts.
The presentation here will include only minor novelties but will bring the key details
together in a coherent story and give examples of the use of Stata and Mata to conrm
and explore for oneself the consequences of a statistical artifact.
2 Deductions
2.1 Limits on skewness and kurtosis
Given a sample of n values y
1
, . . . , y
n
and sample mean y =

n
i=1
y
i
/n, sample moments
measured about the mean are at their simplest dened as averages of powered deviations
m
r
=

n
i=1
(y y)
r
n
so that m
2
and s =

m
2
are versions of, respectively, the sample variance and sample
standard deviation.
Here sample skewness is dened as
m
3
m
3/2
2
=
m
3
s
3
=
_
b
1
= g
1
1. Kaplanskys paper is one of a few that he wrote in the mid-1940s on probability and statistics. He
is much better known as a distinguished algebraist (Bass and Lam 2007; Kadison 2008).
484 Speaking Stata: The limits of sample skewness and kurtosis
while sample kurtosis is dened as
m
4
m
2
2
=
m
4
s
4
= b
2
= g
2
+ 3
Hence, both of the last two measures are scaled or dimensionless: Whatever units
of measurement were used appear raised to the same powers in both numerator and
denominator, and so cancel out. The commonly used m, s, b, and g notation corresponds
to a longstanding , , , and notation for the corresponding theoretical or population
quantities. If 3 appears to be an arbitrary constant in the last equation, one explanation
starts with the fact that normal or Gaussian distributions have
1
= 0 and
2
= 3; hence,

2
= 0.
Naturally, if y is constant, then m
2
is zero; thus skewness and kurtosis are not
dened. This includes the case of n = 1. The stipulations that y is genuinely variable
and that n 2 underlie what follows.
Newcomers to this territory are warned that usages in the statistical literature vary
considerably, even among entirely competent authors. This variation means that dier-
ent formulas may be found for the same termsskewness and kurtosisand dierent
terms for the same formulas. To start at the beginning: Although Karl Pearson in-
troduced the term skewness, and also made much use of
1
, he used skewness to refer
to (mean mode) / standard deviation, a quantity that is well dened in his system
of distributions. In more recent literature, some dierences reect the use of divisors
other than n, usually with the intention of reducing bias, and so resembling in spirit
the common use of n 1 as an alternative divisor for sample variance. Some authors
call
2
(or g
2
) the kurtosis, while yet other variations may be found.
The key results for this column were extensively discussed by Wilkins (1944) and
Dalen (1987). Clearly, g
1
may be positive, zero, or negative, reecting the sign of m
3
.
Wilkins (1944) showed that there is an upper limit to its absolute value,
|g
1
|
n 2

n 1
(1)
as was also independently shown by Kirby (1974). In contrast, b
2
must be positive and
indeed (as may be shown, for example, using the CauchySchwarz inequality) must be
at least 1. More pointedly, Dalen (1987) showed that there is also an upper limit to its
value:
b
2

n
2
3n + 3
n 1
(2)
The proofs of these inequalities are a little too long, and not quite interesting enough,
to reproduce here.
Both of these inequalities are sharp, meaning attainable. Test cases to explore the
precise limits have all values equal to some constant, except for one value that is equal
to another constant: n = 2, y
1
= 0, y
2
= 1 will do ne as a concrete example, for which
skewness is 0/1 = 0 and kurtosis is (1 3 + 3)/1 = 1.
N. J. Cox 485
For n = 2, we can rise above a mere example to show quickly that these results are
indeed general. The mean of two distinct values is halfway between them so that the
two deviations y
i
y have equal magnitude and opposite sign. Thus their cubes have
sum 0, and m
3
and b
1
are both identically equal to 0. Alternatively, such values are
geometrically two points on the real line, a conguration that is evidently symmetric
around the mean in the middle, so skewness can be seen to be zero without any calcula-
tions. The squared deviations have an average equal to {(y
1
y
2
)/2}
2
, and their fourth
powers have an average equal to {(y
1
y
2
)/2}
4
, so g
2
is identically equal to 1.
To see how the upper limit behaves numerically, we can rewrite (1) as
|g
1
|

n 1
1

n 1
so that as sample size n increases, rst

n 1 and then

n become acceptable approx-


imations. Similarly, we can rewrite (2) as
b
2
n 2 +
1
n 1
from which, in large samples, rst n2 and then n become acceptable approximations.
As it happens, these limits established by Wilkins and Dalen sharpen up on the
results of other workers. Limits of

n and n (the latter when n is greater than 3)
were established by Cramer (1946, 357). Limits of

n 1 and n were independently
established by Johnson and Lowe (1979); Kirby (1981) advertised work earlier than
theirs (although not earlier than that of Wilkins or Cramer). Similarly, Stuart and Ord
(1994, 121122) refer to the work of Johnson and Lowe (1979), but overlook the sharper
limits.
2
There is yet another twist in the tale. Pearson (1916, 440) refers to the limit (2),
which he attributes to George Neville Watson, himself later a distinguished contributor
to analysis (but not to be confused with the statistician Georey Stuart Watson), and
to a limit of n 1 on b
1
, equivalent to a limit of

n 1 on g
1
. Although Pearson
was the author of the rst word on this subject, his contribution appears to have been
uniformly overlooked by later authors. However, he dismissed these limits as without
practical importance, which may have led others to downplay the whole issue.
In practice, we are, at least at rst sight, less likely to care much about these limits
for large samples. It is the eld of small samples in which limits are more likely to cause
problems, and sometimes without data analysts even noticing.
2. The treatise of Stuart and Ord is in line of succession, with one oset, from Yule (1911). Despite
that distinguished ancestry, it contains some surprising errors as well as the compendious collection
of results that makes it so useful. To the statement that mean, median, and mode dier in a skewed
distribution (p. 48), counterexamples are 0, 0, 1, 1, 1, 1, 3, and the binomial
`
10
k

0.1
k
0.9
10k
, k =
0, . . . , 10. For both of these skewed counterexamples, mean, median, and mode coincide at 1. To
the statement that they coincide in a symmetric distribution (p. 108), counterexamples are any
symmetric distribution with an even number of modes.
486 Speaking Stata: The limits of sample skewness and kurtosis
2.2 An aside on coecient of variation
The literature contains similar limits related to sample size on other sample statistics.
For example, the coecient of variation is the ratio of standard deviation to mean, or
s/y. Katsnelson and Kotz (1957) proved that so long as all y
i
0, then the coecient of
variation cannot exceed

n 1, a result mentioned earlier by Longley (1952). Cramer
(1946, 357) proved a less sharp result, and Kirby (1974) proved a less general result.
3 Conrmations
[R] summarize conrms that skewness b
1
and kurtosis g
2
are calculated in Stata pre-
cisely as above. There are no corresponding Mata functions at the time of this writing,
but readers interested in these questions will want to start Mata to check their own
understanding. One example to check is
. sysuse auto, clear
(1978 Automobile Data)
. summarize mpg, detail
Mileage (mpg)
Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of Wgt. 74
50% 20 Mean 21.2973
Largest Std. Dev. 5.785503
75% 25 34
90% 29 35 Variance 33.47205
95% 34 35 Skewness .9487176
99% 41 41 Kurtosis 3.975005
The detail option is needed to get skewness and kurtosis results from summarize.
We will not try to write a bulletproof skewness or kurtosis function in Mata, but we
will illustrate its use calculator-style. After entering Mata, a variable can be read into
a vector. It is helpful to have a vector of deviations from the mean to work on.
. mata :
mata (type end to exit)
: y = st_data(., "mpg")
: dev = y :- mean(y)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
.9487175965
: mean(dev:^4) / (mean(dev:^2)):^2
3.975004596
So those examples at least check out. Those unfamiliar with Mata might note that
the colon prex, as in :- or :^, merely ags an elementwise operation. Thus for example,
mean(y) returns a constant, which we wish to subtract from every element of a data
vector.
N. J. Cox 487
Mata may be used to check simple limiting cases. The minimal dataset (0, 1) may
be entered in deviation form. After doing so, we can just repeat earlier lines to calculate
b
1
and g
2
:
: dev = (.5 \ -.5)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
0
: mean(dev:^4) / (mean(dev:^2)):^2
1
Mata may also be used to see how the limits of skewness and kurtosis vary with
sample size. We start out with a vector containing some sample sizes. We then calculate
the corresponding upper limits for skewness and kurtosis and tabulate the results. The
results are mapped to strings for tabulation with reasonable numbers of decimal places.
: n = (2::20\50\100\500\1000)
: skew = sqrt(n:-1) :- (1:/(n:-1))
: kurt = n :- 2 + (1:/(n:-1))
: strofreal(n), strofreal((skew, kurt), "%4.3f")
1 2 3
1 2 0.000 1.000
2 3 0.914 1.500
3 4 1.399 2.333
4 5 1.750 3.250
5 6 2.036 4.200
6 7 2.283 5.167
7 8 2.503 6.143
8 9 2.703 7.125
9 10 2.889 8.111
10 11 3.062 9.100
11 12 3.226 10.091
12 13 3.381 11.083
13 14 3.529 12.077
14 15 3.670 13.071
15 16 3.806 14.067
16 17 3.938 15.062
17 18 4.064 16.059
18 19 4.187 17.056
19 20 4.306 18.053
20 50 6.980 48.020
21 100 9.940 98.010
22 500 22.336 498.002
23 1000 31.606 998.001
The second and smaller term in each expression for (1) and (2) is 1/(n1). Although
the calculation is, or should be, almost mental arithmetic, we can see how quickly this
term shrinks so much that it can be neglected:
488 Speaking Stata: The limits of sample skewness and kurtosis
: strofreal(n), strofreal(1 :/ (n :- 1), "%4.3f")
1 2
1 2 1.000
2 3 0.500
3 4 0.333
4 5 0.250
5 6 0.200
6 7 0.167
7 8 0.143
8 9 0.125
9 10 0.111
10 11 0.100
11 12 0.091
12 13 0.083
13 14 0.077
14 15 0.071
15 16 0.067
16 17 0.062
17 18 0.059
18 19 0.056
19 20 0.053
20 50 0.020
21 100 0.010
22 500 0.002
23 1000 0.001
: end
These calculations are equally easy in Stata when you start with a variable containing
sample sizes.
4 Explorations
In statistical science, we use an increasing variety of distributions. Even when closed-
form expressions exist for their moments, which is far from being universal, the need
to estimate parameters from sample data often arises. Thus the behavior of sample
moments and derived measures remains of key interest. Even if you do not customarily
use, for example, summarize, detail to get skewness and kurtosis, these measures may
well underlie your favorite test for normality.
The limits on sample skewness and kurtosis impart the possibility of bias whenever
the upper part of their sampling distributions is cut o by algebraic constraints. In
extreme cases, a sample may even deny the distribution that underlies it, because it is
impossible for any sample to reproduce the skewness and kurtosis of its parent.
These questions may be explored by simulation. Lognormal distributions oer simple
but striking examples. We call a distribution for y lognormal if ln y is normally dis-
tributed. Those who prefer to call normal distributions by some other name (Gaussian,
notably) have not noticeably aected this terminology. Similarly, for some people the
terminology is backward, because a lognormal distribution is an exponentiated normal
distribution. Protest is futile while the term lognormal remains entrenched.
N. J. Cox 489
If ln y has mean and standard deviation , its skewness and kurtosis may be
dened in terms of exp(
2
) = (Johnson, Kotz, and Balakrishnan 1994, 212):

1
=

1( + 2);
2
=
4
+ 2
3
+ 3
2
3
Dierently put, skewness and kurtosis depend on alone; is a location parameter for
the lognormal as well as the normal.
[R] simulate already has a worked example of the simulation of lognormals, which
we can adapt slightly for the present purpose. The program there called lnsim merely
needs to be modied by adding results for skewness and kurtosis. As before, summarize,
detail is now the appropriate call. Before simulation, we (randomly, capriciously, or
otherwise) choose a seed for random-number generation:
. clear all
. program define lnsim, rclass
1. version 11.1
2. syntax [, obs(integer 1) mu(real 0) sigma(real 1)]
3. drop _all
4. set obs `obs
5. tempvar z
6. gen `z = exp(rnormal(`mu,`sigma))
7. summarize `z, detail
8. return scalar mean = r(mean)
9. return scalar var = r(Var)
10. return scalar skew = r(skewness)
11. return scalar kurt = r(kurtosis)
12. end
. set seed 2803
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50) mu(-3) sigma(7)
command: lnsim, obs(50) mu(-3) sigma(7)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
We are copying here the last example from help simulate, a lognormal for which
= 3, = 7. While a lognormal may seem a fairly well-behaved distribution, a quick
calculation shows that with these parameter choices, the skewness is about 810
31
and
the kurtosis about 10
85
, which no sample result can possibly come near! The previously
discussed limits are roughly 7 for skewness and 48 for kurtosis for this sample size. Here
are the Mata results:
. mata
mata (type end to exit)
: omega = exp(49)
: sqrt(omega - 1) * (omega + 2)
8.32999e+31
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
1.32348e+85
: n = 50
490 Speaking Stata: The limits of sample skewness and kurtosis
: sqrt(n:-1) :- (1:/(n:-1)), n :- 2 + (1:/(n:-1))
1 2
1 6.979591837 48.02040816
: end
Sure enough, calculations and a graph (shown as gure 1) show the limits of 7 and
48 are biting hard. Although many graph forms would work well, I here choose qplot
(Cox 2005) for quantile plots.
. summarize
Variable Obs Mean Std. Dev. Min Max
mean 10000 1.13e+09 1.11e+11 1.888205 1.11e+13
var 10000 6.20e+23 6.20e+25 42.43399 6.20e+27
skew 10000 6.118604 .9498364 2.382902 6.857143
kurt 10000 40.23354 10.06829 7.123528 48.02041
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2
2
3
4
5
6
7
s
k
e
w
n
e
s
s
0 .2 .4 .6 .8 1
fraction of the data
10
20
30
40
50
k
u
r
t
o
s
i
s
0 .2 .4 .6 .8 1
fraction of the data
Figure 1. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with = 3, = 7. Upper limits are shown by horizontal lines.
The natural comment is that the parameter choices in this example are a little
extreme, but the same phenomenon occurs to some extent even with milder choices.
With the default = 0, = 1, the skewness and kurtosis are less explosively highbut
still very high by many standards. We clear the data and repeat the simulation, but
this time we use the default values.
N. J. Cox 491
. clear
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50)
command: lnsim, obs(50)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
Within Mata, we can recalculate the theoretical skewness and kurtosis. The limits
to sample skewness and kurtosis remain the same, given the same sample size n = 50.
. mata
mata (type end to exit)
: omega = exp(1)
: sqrt(omega - 1) * (omega + 2)
6.184877139
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
113.9363922
: end
The problem is more insidious with these parameter values. The sampling distri-
butions look distinctly skewed (shown in gure 2) but are not so obviously truncated.
Only when the theoretical values for skewness and kurtosis are considered is it obvious
that the estimations are seriously biased.
. summarize
Variable Obs Mean Std. Dev. Min Max
mean 10000 1.657829 .3106537 .7871802 4.979507
var 10000 4.755659 7.43333 .3971136 457.0726
skew 10000 2.617803 1.092607 .467871 6.733598
kurt 10000 11.81865 7.996084 1.952879 46.89128
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2
(Continued on next page)
492 Speaking Stata: The limits of sample skewness and kurtosis
0
2
4
6
8
s
k
e
w
n
e
s
s
0 .2 .4 .6 .8 1
fraction of the data
0
10
20
30
40
50
k
u
r
t
o
s
i
s
0 .2 .4 .6 .8 1
fraction of the data
Figure 2. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with = 0, = 1. Upper limits are shown by horizontal lines.
Naturally, these are just token simulations, but a way ahead should be clear. If
you are using skewness or kurtosis with small (or even large) samples, simulation with
some parent distributions pertinent to your work is a good idea. The simulations of
Wallis, Matalas, and Slack (1974) in particular pointed to empirical limits to skewness,
which Kirby (1974) then established independently of previous work.
3
5 Conclusions
This story, like any other, lies at the intersection of many larger stories. Many statisti-
cally minded people make little or no use of skewness or kurtosis, and this paper may
have conrmed them in their prejudices. Some readers may prefer to see this as an-
other argument for using quantiles or order statistics for summarization (Gilchrist 2000;
David and Nagaraja 2003). Yet others may know that L-moments oer an alternative
approach (Hosking 1990; Hosking and Wallis 1997).
Arguably, the art of statistical analysis lies in choosing a model successful enough
to ensure that the exact form of the distribution of some response variable, conditional
on the predictors, is a matter of secondary importance. For example, in the simplest
regression situations, an error term for any really good model is likely to be fairly near
normally distributed, and thus not a source of worry. But authorities and critics dier
over how far that is a deductive consequence of some avor of central limit theorem or
a nave article of faith that cries out for critical evaluation.
3. Connoisseurs of obeat or irreverent titles might like to note some other papers by the same team:
Mandelbrot and Wallis (1968), Matalas and Wallis (1973), and Slack (1973).
N. J. Cox 493
More prosaically, it is a truismbut one worthy of assentthat researchers using
statistical methods should know the strengths and weaknesses of the various items in
the toolbox. Skewness and kurtosis, over a century old, may yet oer surprises, which
a wide range of Stata and Mata commands may help investigate.
6 Acknowledgments
This column benets from interactions over moments shared with Ian S. Evans and over
L-moments shared with Patrick Royston.
7 References
Bass, H., and T. Y. Lam. 2007. Irving Kaplansky 19172006. Notices of the American
Mathematical Society 54: 14771493.
Cox, N. J. 2005. Speaking Stata: The protean quantile plot. Stata Journal 5: 442460.
Cramer, H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton Uni-
versity Press.
Dalen, J. 1987. Algebraic bounds on standardized sample moments. Statistics & Prob-
ability Letters 5: 329331.
David, H. A. 2001. First (?) occurrence of common terms in statistics and probability.
In Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F.
Edwards, 209246. New York: Springer.
David, H. A., and H. N. Nagaraja. 2003. Order Statistics. Hoboken, NJ: Wiley.
Fiori, A. M., and M. Zenga. 2009. Karl Pearson and the origin of kurtosis. International
Statistical Review 77: 4050.
Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.
Gilchrist, W. G. 2000. Statistical Modelling with Quantile Functions. Boca Raton, FL:
Chapman & Hall/CRC.
Hald, A. 1981. T. N. Thieles contribution to statistics. International Statistical Review
49: 120.
. 1998. A History of Mathematical Statistics from 1750 to 1930. New York:
Wiley.
. 2007. A History of Parametric Statistical Inference from Bernoulli to Fisher,
17131935. New York: Springer.
494 Speaking Stata: The limits of sample skewness and kurtosis
Hosking, J. R. M. 1990. L-moments: Analysis and estimation of distributions using lin-
ear combinations of order statistics. Journal of the Royal Statistical Society, Series B
52: 105124.
Hosking, J. R. M., and J. R. Wallis. 1997. Regional Frequency Analysis: An Approach
Based on L-Moments. Cambridge: Cambridge University Press.
Johnson, M. E., and V. W. Lowe. 1979. Bounds on the sample skewness and kurtosis.
Technometrics 21: 377378.
Johnson, N. L., S. Kotz, and N. Balakrishnan. 1994. Continuous Univariate Distribu-
tions, Vol. 1. New York: Wiley.
Kadison, R. V. 2008. Irving Kaplanskys role in mid-twentieth century functional anal-
ysis. Notices of the American Mathematical Society 55: 216225.
Kaplansky, I. 1945. A common error concerning kurtosis. Journal of the American
Statistical Association 40: 259.
Katsnelson, J., and S. Kotz. 1957. On the upper limits of some measures of variability.
Archiv f ur Meteorologie, Geophysik und Bioklimatologie, Series B 8: 103107.
Kirby, W. 1974. Algebraic boundedness of sample statistics. Water Resources Research
10: 220222.
. 1981. Letter to the editor. Technometrics 23: 215.
Lauritzen, S. L. 2002. Thiele: Pioneer in Statistics. Oxford: Oxford University Press.
Longley, R. W. 1952. Measures of the variability of precipitation. Monthly Weather
Review 80: 111117.
Mandelbrot, B. B., and J. R. Wallis. 1968. Noah, Joseph, and operational hydrology.
Water Resources Research 4: 909918.
Matalas, N. C., and J. R. Wallis. 1973. Eureka! It ts a Pearson type 3 distribution.
Water Resources Research 9: 281289.
Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course
in Statistics. Reading, MA: AddisonWesley.
Pearson, K. 1916. Mathematical contributions to the theory of evolution. XIX: Second
supplement to a memoir on skew variation. Philosophical Transactions of the Royal
Society of London, Series A 216: 429457.
Slack, J. R. 1973. I would if I could (self-denial by conditional models). Water Resources
Research 9: 247249.
Stuart, A., and J. K. Ord. 1994. Kendalls Advanced Theory of Statistics. Volume 1:
Distribution Theory. 6th ed. London: Arnold.
N. J. Cox 495
Taylor, S. J. 2005. Asset Price Dynamics, Volatility, and Prediction. Princeton, NJ:
Princeton University Press.
Thiele, T. N. 1889. Forlsinger over Almindelig Iagttagelseslre: Sandsynlighedsregn-
ing og Mindste Kvadraters Methode. Copenhagen: C. A. Reitzel. English translation
included in Lauritzen 2002.
Walker, H. M. 1929. Studies in the History of Statistical Method: With Special Refer-
ence to Certain Educational Problems. Baltimore: Williams & Wilkins.
Wallis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a moment! Water Resources
Research 10: 211219.
Wilkins, J. E. 1944. A note on skewness and kurtosis. Annals of Mathematical Statistics
15: 333335.
Yule, G. U. 1911. An Introduction to the Theory of Statistics. London: Grin.
About the author
Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,
postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-
mands in ocial Stata. He wrote several inserts in the Stata Technical Bulletin and is an editor
of the Stata Journal.
The Stata Journal (2010)
10, Number 3, pp. 496499
Stata tip 89: Estimating means and percentiles following
multiple imputation
Peter A. Lachenbruch
Oregon State University
Corvallis, OR
peter.lachenbruch@oregonstate.edu
1 Introduction
In a statistical analysis, I usually want some basic descriptive statistics such as the mean,
standard deviation, extremes, and percentiles. See, for example, Pagano and Gauvreau
(2000). Stata conveniently provides these descriptive statistics with the summarize
commands detail option. Alternatively, I can obtain percentiles with the centile
command. For example, with auto.dta, we have
. sysuse auto
(1978 Automobile Data)
. summarize price, detail
Price
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
However, if I have missing values, the summarize command is not supported by mi
estimate or by the user-written mim command (Royston 2004, 2005a,b, 2007; Royston,
Carlin, and White 2009).
2 Finding means and percentiles when missing values are
present
For a general multiple-imputation reference, see Stata 11 Multiple-Imputation Reference
Manual (2009). By recognizing that a regression with no independent variables estimates
the mean, I can use mi estimate: regress to get multiply imputed means. If I wish to
get multiply imputed quantiles, I can use mi estimate: qreg or mi estimate: sqreg
for this purpose.
c 2010 StataCorp LP st0205
P. A. Lachenbruch 497
I now create a dataset with missing values of price:
. clonevar newprice = price
. set seed 19670221
. replace newprice = . if runiform() < .4
(32 real changes made, 32 to missing)
The following commands were generated from the multiple-imputation dialog box. I
used 20 imputations. Before Stata 11, this could also be done with the user-written com-
mands ice and mim (Royston 2004, 2005a,b, 2007; Royston, Carlin, and White 2009).
. mi set mlong
. mi register imputed newprice
(32 m=0 obs. now marked as incomplete)
. mi register regular mpg trunk weight length
. mi impute regress newprice, add(20) rseed(3252010)
Univariate imputation Imputations = 20
Linear regression added = 20
Imputed: m=1 through m=20 updated = 0
Observations per m
Variable complete incomplete imputed total
newprice 42 32 32 74
(complete + incomplete = total; imputed is the minimum across m
of the number of filled in observations.)
. mi estimate: regress newprice
Multiple-imputation estimates Imputations = 20
Linear regression Number of obs = 74
Average RVI = 1.3880
Complete DF = 73
DF: min = 19.46
avg = 19.46
DF adjustment: Small sample max = 19.46
F( 0, .) = .
Within VCE type: OLS Prob > F = .
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
_cons 5693.489 454.9877 12.51 0.000 4742.721 6644.258
From this output, we see that the estimated mean is 5,693 with a standard error
of 455 (rounded up) compared with the complete data value of 6,165 with a standard
error of 343 (also rounded up). However, we do not have estimates of quantiles. This
could also have been done using mi estimate: mean newprice (the mean command is
near the bottom of the estimation command list for mi estimate).
498 Stata tip 89
We can apply the same principle using qreg. For the 10th percentile, type
. mi estimate: qreg newprice, quantile(10)
Multiple-imputation estimates Imputations = 20
.1 Quantile regression Number of obs = 74
Average RVI = 0.2901
Complete DF = 73
DF: min = 48.05
avg = 48.05
DF adjustment: Small sample max = 48.05
F( 0, .) = .
Prob > F = .
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
_cons 3495.635 708.54 4.93 0.000 2071.058 4920.212
Compare the value of 3,496 with the value of 3,895 from the full data. We can use
the simultaneous estimates command for the full set:
. mi estimate: sqreg newprice, quantiles(10 25 50 75 90) reps(20)
Multiple-imputation estimates Imputations = 20
Simultaneous quantile regression Number of obs = 74
Average RVI = 0.6085
Complete DF = 73
DF adjustment: Small sample DF: min = 23.19
avg = 26.97
max = 31.65
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
q10
_cons 3495.635 533.5129 6.55 0.000 2408.434 4582.836
q25
_cons 4130.037 237.1932 17.41 0.000 3642.459 4617.614
q50
_cons 5200.238 441.294 11.78 0.000 4292.719 6107.757
q75
_cons 6620.232 778.8488 8.50 0.000 5025.49 8214.974
q90
_cons 8901.985 1417.022 6.28 0.000 5971.962 11832.01
3 Comments and cautions
The qreg command does not give the same result as the centile command when
you have complete data. This is because the centile command uses one observation,
while the qreg command uses a weighted combination of the observations. It will have
somewhat shorter condence intervals, but with large datasets, the dierence will be
P. A. Lachenbruch 499
small. A second caution is that comparing two medians can be tricky: the dierence
of two medians is not the median dierence of the distributions. I have found it useful
to use percentiles because there is a one-to-one relationship between percentiles if data
are transformed. In our case, there is plentiful evidence that price is not normally
distributed, so it would be good to look for a transformation and impute those values.
This method of using regression commands without an independent variable can
provide estimates of quantities that otherwise would be dicult to obtain. For example,
it is much faster than nding 20 imputed percentiles and then combining them with
Rubins rules, and it is much less onerous and prone to error.
4 Acknowledgment
This work was supported in part by a grant from the Cure JM Foundation.
References
Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA:
Duxbury.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227241.
. 2005a. Multiple imputation of missing values: Update. Stata Journal 5: 188
201.
. 2005b. Multiple imputation of missing values: Update of ice. Stata Journal 5:
527536.
. 2007. Multiple imputation of missing values: Further update of ice, with an
emphasis on interval censoring. Stata Journal 7: 445464.
Royston, P., J. B. Carlin, and I. R. White. 2009. Multiple imputation of missing values:
New features for mim. Stata Journal 9: 252264.
StataCorp. 2009. Stata 11 Multiple-Imputation Reference Manual. College Station, TX:
Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 500502
Stata tip 90: Displaying partial results
Martin Weiss
Department of Economics
T ubingen University
T ubingen, Germany
martin.weiss@uni-tuebingen.de
Stata provides several features that allow users to display only part of their results.
If, for instance, you merely wanted to inspect the analysis of variance table returned by
anova or the coecients returned by regress, you could instruct Stata to omit other
results:
. sysuse auto
(1978 Automobile Data)
. regress weight length price, notable
Source SS df MS Number of obs = 74
F( 2, 71) = 385.80
Model 40378658.3 2 20189329.2 Prob > F = 0.0000
Residual 3715520.06 71 52331.2685 R-squared = 0.9157
Adj R-squared = 0.9134
Total 44094178.4 73 604029.841 Root MSE = 228.76
. regress weight length price, noheader
weight Coef. Std. Err. t P>|t| [95% Conf. Interval]
length 30.60949 1.333171 22.96 0.000 27.95122 33.26776
price .042138 .0100644 4.19 0.000 .0220702 .0622058
_cons -2992.848 232.1722 -12.89 0.000 -3455.786 -2529.91
Other examples of this type can be found in the help les for xtivreg for its rst-
stage results and for xtmixed for its random-eects and xed-eects table. Generally,
to check whether Stata does provide such options, you would look for them under the
heading Reporting in the respective help les.
If you want to further customize output to your own needs, you could use the
estimates table command; see [R] estimates table. It is part of the comprehensive
estimates suite of commands that save and manipulate estimation results in Stata. See
[R] estimates or Baum (2006, sec. 4.4), where user-written alternatives are introduced
as well.
estimates table can provide several benets to the user. For one, you can restrict
output to selected coecients or equations with its keep() and drop() options.
c 2010 StataCorp LP st0206
M. Weiss 501
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price)
Variable active
turn 35.214901
price .04624804
The original output of the estimation command itself is suppressed with quietly;
see [P] quietly. The keep() option also changes the order of the coecients according
to your wishes. Additionally, you can elect to have Stata display results in a specic
format, for example, with fewer or more decimal places. The format can dier between
the elements that you choose to put into the table. In the case shown below, the
coecients have three decimal places, while the standard error and the p-value have
two decimal places:
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price) b(%9.3fc) se(%9.2fc) p(%9.2fc)
Variable active
turn 35.215
11.65
0.00
price 0.046
0.01
0.00
legend: b/se/p
estimates table can also deal with models featuring multiple equations. If you
want to omit the coecients for weight and the constant from every equation of your
sureg model, you could type
. sysuse auto
(1978 Automobile Data)
. qui sureg (price foreign weight length turn) (mpg foreign weight turn)
. estimates table, drop(weight _cons)
Variable active
price
foreign 3320.6181
length -78.75447
turn -144.37952
mpg
foreign -2.0756325
turn -.23516574
502 Stata tip 90
If your interest rests in the entire rst equation and the constant from the second
equation, you would prepend coecients with the equation names and separate the two
with a colon. The names of equations and coecients are more accessible in Stata 11
with the coeflegend option, which is accepted by most estimation commands.
. sureg, coeflegend noheader
Coef. Legend
price
foreign 3320.618 _b[price:foreign]
weight 6.04491 _b[price:weight]
length -78.75447 _b[price:length]
turn -144.3795 _b[price:turn]
_cons 7450.657 _b[price:_cons]
mpg
foreign -2.075632 _b[mpg:foreign]
weight -.0055959 _b[mpg:weight]
turn -.2351657 _b[mpg:turn]
_cons 48.13492 _b[mpg:_cons]
. estimates table, keep(price: mpg:weight)
Variable active
price
foreign 3320.6181
weight 6.0449101
length -78.75447
turn -144.37952
_cons 7450.657
mpg
weight -.00559588
See help estimates table to learn more about the syntax.
Reference
Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College
Station, TX: Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 503504
Stata tip 91: Putting unabbreviated varlists into local
macros
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
n.j.cox@durham.ac.uk
Within interactive sessions, do-les, or programs, Stata users often want to work
with varlists, lists of variable names. For convenience, such lists may be stored in local
macros. Local macros can be directly dened for later use, as in
. local myx "weight length displacement"
. regress mpg `myx
However, users frequently want to put longer lists of names into local macros, spelled
out one by one so that some later command can loop through the list dened by the
macro. Such varlists might be indirectly dened in abbreviations using the wildcard
characters * or ?. These characters can be used alone or can be combined to express
ranges. For example, specifying * catches all variables, *TX* might dene all variables
for Texas, and *200? catches the years 20002009 used as suxes.
In such cases, direct denition may not appeal for all the obvious reasons: it is
tedious, time-consuming, and error-prone. It is also natural to wonder if there is a
better method. You may already know that foreach (see [P] foreach) will take such
wildcarded varlists as arguments, which solves many problems.
Many users know that pushing an abbreviated varlist through describe or ds is
one way to produce an unabbreviated varlist. Thus
. describe, varlist
is useful principally for its side eect of leaving all the variable names in r(varlist).
ds is typically used in a similar way, as is the user-written findname command (Cox
2010).
However, if the purpose is just to produce a local macro, the method of using
describe or ds has some small but denite disadvantages. First, the output of each
may not be desired, although it is easily suppressed with a quietly prex. Second,
the modus operandi of both describe and ds is to leave saved results as r-class results.
Every now and again, users will be frustrated by this when they unwittingly overwrite
r-class results that they wished to use again. Third, there is some ineciency in using
either command for this purpose, although you would usually have to work hard to
measure it.
c 2010 StataCorp LP dm0051
504 Stata tip 91
The solution here is to use the unab command; see [P] unab. unab has just one
restricted role in life, but that role is the solution here. unab is billed as a programming
command, but nothing stops it from being used interactively as a simple tool in data
management. The simple examples
. unab myvars : *
. unab TX : *TX*
. unab twenty : *200?
show how a local macro, named at birth (here as myvars, TX, and twenty), is dened
as the unabbreviated equivalent of each argument that follows a colon. Note that using
wildcard characters, although common, is certainly not compulsory.
The word unabbreviate is undoubtedly ugly. The help and manual entry do also
use the much simpler and more attractive word expand, but the word expand was
clearly not available as a command name, given its use for another purpose.
This tip skates over all the ne details of unab, and only now does it mention the
sibling commands tsunab and fvunab, for use when you are using time-series operators
and factor variables. For more information, see [P] unab.
Reference
Cox, N. J. 2010. Speaking Stata: Finding variables. Stata Journal 10: 281296.
The Stata Journal (2010)
10, Number 3, p. 505
Software Updates
st0140 2: fuzzy: A program for performing qualitative comparative analyses (QCA) in
Stata. K. C. Longest and S. Vaisey. Stata Journal 8: 452; 79104.
A typo has been xed in the setgen subcommand. Specically, the drect extension
was not calculating values below the middle anchor correctly because of the typo.
This has been xed and drect is now operating correctly. Note that no other aspects
of the setgen command have been altered.
c 2010 StataCorp LP up0029

S-ar putea să vă placă și