Documente Academic
Documente Profesional
Documente Cultură
i=1
p
i
f (x
i
)
For concreteness, we consider a die known to have E (x) = 3.5, where x =(1, 2, 3, 4, 5, 6),
and we want to determine the associated probabilities. Clearly, there are innitely many
possible solutions, but the obvious one is p
1
= p
2
= = p
6
= 1/6. The obviousness is
based on Laplaces principle of insucient reason, which states that two events should
be assigned equal probability unless there is a reason to think otherwise (Jaynes 1957,
c 2010 StataCorp LP st0196
316 Maximum entropy estimation
622). This negative reason is not much help if, instead, we know that E (x) = 4.
Jayness solution was to tackle this from the point of view of Shannons information
theory. Jaynes wanted a criterion function H (p
1
, p
2
, . . . , p
n
) that would summarize the
uncertainty about the distribution. This is given uniquely by the entropy measure
H (p
1
, p
2
, . . . , p
n
) = K
n
i=1
p
i
ln(p
i
)
where p
i
ln(p
i
) is dened to be zero if p
i
= 0 for some positive constant K. The
solution to Jayness problem is to pick the distribution (p
1
, p
2
, . . . , p
n
) that maximizes
the entropy, subject only to the constraints
E {f (x)} =
i
p
i
f (x
i
)
i
p
i
= 1
As Golan, Judge, and Miller (1996, 810) show, if our knowledge of E {f (x)} is
based on the outcome of N (very large) trials, then the distribution function p = (p
1
, p
2
,
. . . , p
n
) that maximizes the entropy measure is the distribution that can give rise to
the observed outcomes in the greatest number of ways, which is consistent with what
we know. Any other distribution requires more information to justify it. Degenerate
distributions, ones where p
i
= 1 and p
j
= 0 for j = i, have entropy of zero. That is to
say, they correspond to zero uncertainty and therefore maximal information.
2 Maximum entropy and minimum cross-entropy estima-
tion
More formally, the maximum entropy problem can be represented as
max
p
H (p) =
n
i=1
p
i
ln(p
i
)
such that y
j
=
n
i=1
X
ji
p
i
, j = 1, . . . , J (1)
n
i=1
p
i
= 1 (2)
The J constraints given in (1) can be thought of as moment constraints, with y
j
being
the population mean of the X
j
random variable. To solve this problem, we set up the
Lagrangian function
L = p
ln(p) (X
p y) (p
1 1)
M. Wittenberg 317
where X is the n J data matrix,
1
is a vector of Lagrange multipliers, and 1 is a
column vector of ones.
The rst-order conditions for an interior solutionthat is, one in which the vector
p is strictly positiveare given by
L
p
= ln( p)1 X
1 = 0 (3)
L
= y X
p = 0 (4)
L
= 1 p
1 = 0 (5)
These equations can be solved for
, and the solution for p is given by
p = exp
_
X
_
/
_
_
where
_
=
n
i=1
exp
_
x
i
_
and x
i
is the ith row vector of the matrix X.
The maximum entropy framework can be extended to incorporate prior information
about p. Assuming that we have the prior probability distribution q =(q
1
, q
2
, . . . , q
n
),
then the cross-entropy is dened as (Golan, Judge, and Miller 1996, 11)
I (p, q) =
n
i=1
p
i
ln
_
p
i
q
i
_
= p
ln(p) p
ln(q)
The cross-entropy can be thought of as a measure of the additional information required
to go from the distribution q to the distribution p. The principle of minimum cross-
entropy asserts that we should pick the distribution p that meets the moment constraints
(1) and the normalization restriction (2) while requiring the least additional information;
that is, we should pick the one that is in some sense closest to q. Formally, we minimize
I (p, q), subject to the restrictions. Maximum entropy estimation is merely a variant
of minimum cross-entropy estimation where the prior q is the uniform distribution
(1/n, 1/n, . . . , 1/n).
1. In the Golan, Judge, and Miller (1996) book, the constraint is written as y = Xp, where X is Jn.
For the applications considered below, it is more natural to write the data matrix in the form shown
here.
318 Maximum entropy estimation
The solution of this problem is given by (Golan, Judge, and Miller 1996, 29)
p
i
= q
i
exp
_
x
i
_
/
_
_
(6)
where
_
=
n
i=1
q
i
exp
_
x
i
_
(7)
The most ecient way to calculate the estimates is, in fact, not by numerical solution
of the rst-order conditions [along the lines of (3), (4), and (5)] but by the unconstrained
maximization of the dual problem as discussed further in section 3.5.
3 The maxentropy command
3.1 Syntax
The syntax of the maxentropy command is
maxentropy
_
constraint
varlist
_
if
_
in
, generate(varname
_
, replace
)
_
prior(varname) log total(#) matrix(matrix)
The maxentropy command must identify the set of population constraints contained in
the y vector. These population constraints can be specied either as
_
constraint
or as
matrix in the matrix() option. If neither of these optional arguments is specied, it is
assumed that varlist is y and then X.
The command requires that a varname be specied in the generate() option, in
which the estimated p vector will be returned.
3.2 Description
maxentropy provides minimum cross-entropy or maximum entropy estimates of ill-
posed inverse problems, such as the Jayness dice problem. The command can also
be used to calibrate survey datasets to external totals along the lines of the multi-
plicative method implemented in the SAS CALMAR macro (Deville and S arndal 1992;
Deville, S arndal, and Sautory 1993). This is a generalization of iterative raking as im-
plemented, for instance, in Nick Winters survwgt command, which is available from
the Statistical Software Components archive (type net search survwgt).
3.3 Options
generate(varname
_
, replace
j=1
j
y
j
ln {()} = M() (8)
where () is given by (7). Golan, Judge, and Miller show that this function behaves
like a maximum likelihood. In this case,
M() = y X
p (9)
so that the constraint is met at the point where the gradient is zero. Furthermore,
2
M
2
j
=
n
i=1
p
i
x
2
ji
_
n
i=1
p
i
x
ji
_
2
= var (x
j
) (10)
2
M
k
=
n
i=1
p
i
x
ji
x
ki
_
n
i=1
p
i
x
ji
__
n
i=1
p
i
x
ki
_
= cov (x
j
, x
k
) (11)
where the variances and covariances are taken with respect to the distribution p. The
negative of the Hessian of M is therefore guaranteed to be positive denite, which
guarantees a unique solution provided that the constraints are not inconsistent.
Golan, Judge, and Miller (1996, 25) note that the function M can be thought of
as an expected log likelihood, given the exponential family p() parameterized by .
Along these lines, we use Statas maximum likelihood routines to estimate , giving it
the dual objective function [(8)], gradient [(9)], and negative Hessian [(10) and (11)].
The routine that calculates these is contained in maxentlambda d2.ado. Because of
the globally concave nature of the objective function, convergence should be relatively
quick, provided that there is a feasible solution in the interior of the parameter space.
The command checks for some obvious errors; for example, the population means (y
j
)
must be inside the range of the X
j
variables. If any mean is on the boundary of the
range, then a degenerate solution is feasible, but the corresponding Lagrange multiplier
will be , so the algorithm will not converge.
Once the estimates of have been obtained, estimates of p are derived from (6).
3.6 Saved results
maxentropy saves the following in e():
Macros
e(cmd) maxentropy e(properties) b V
Matrices
e(b) coecient estimates e(V) inverse of negative Hessian
e(constraint) constraint vector
Functions
e(sample) marks estimation sample
322 Maximum entropy estimation
3.7 A cautionary note
The estimation routine treats as though it were estimated by maximum likelihood.
This is true only if we can write p as
p exp (X)
Given that assumption, we could test hypotheses on the parameters. Because the esti-
mation routine calculates the inverse of the negative of the Hessian (that is, the asymp-
totic covariance matrix of under this parametric assumption), it would be possible to
implement such tests. For most practical applications, this parametric interpretation of
the procedure is likely to be dubious.
4 Sample applications
4.1 Jayness die problem
In section 3.4, I showed how to calculate the probability distribution given that y = 4.
The following code generates predictions given dierent values for y:
matrix y=(2)
maxentropy x, matrix(y) generate(p2)
matrix y=(3)
maxentropy x, matrix(y) generate(p3)
matrix y=(3.5)
maxentropy x, matrix(y) generate(p35)
matrix y=(5)
maxentropy x, matrix(y) generate(p5)
list p2 p3 p35 p4 p5, sep(10)
The impact of dierent prior information on the estimated probabilities is shown in
the following table:
. list p2 p3 p35 p4 p5, sep(10)
p2 p3 p35 p4 p5
1. .4781198 .2467824 .1666667 .1030653 .0205324
2. .254752 .2072401 .1666667 .1227305 .0385354
3. .135737 .1740337 .1666667 .146148 .0723234
4. .0723234 .146148 .1666667 .1740337 .135737
5. .0385354 .1227305 .1666667 .2072401 .2547519
6. .0205324 .1030652 .1666667 .2467824 .4781198
Note in particular that when we set y = 3.5, the command returns the uniform
discrete distribution with p
i
= 1/6.
We can see the impact of adding in a second constraint by considering the same
problem given the population moments
y =
_
2
_
=
_
3.5
2
_
M. Wittenberg 323
for dierent values of
2
. By denition in this case,
2
=
6
i=1
p
i
(x
i
3.5)
2
. We
can therefore create the values (x
i
3.5)
2
and consider which probability distribution
p =(p
1
, p
2
, . . . , p
6
) will generate both a mean of 3.5 and a given value of
2
. The code
to run this is
generate dev2=(x-3.5)^2
matrix y=(3.5 \ (2.5^2/3+1.5^2/3+0.5^2/3))
maxentropy x dev2, matrix(y) generate(pv)
matrix y=(3.5 \ 1)
maxentropy x dev2, matrix(y) generate(pv1)
matrix y=(3.5 \ 2)
maxentropy x dev2, matrix(y) generate(pv2)
matrix y=(3.5 \ 3)
maxentropy x dev2, matrix(y) generate(pv3)
matrix y=(3.5 \ 4)
maxentropy x dev2, matrix(y) generate(pv4)
matrix y=(3.5 \ 5)
maxentropy x dev2, matrix(y) generate(pv5)
matrix y=(3.5 \ 6)
maxentropy x dev2, matrix(y) generate(pv6)
with the following nal result:
. list pv1 pv2 pv pv3 pv4 pv5 pv6, sep(10) noobs
pv1 pv2 pv pv3 pv4 pv5 pv6
.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601
.1316041 .1719114 .1666667 .1651027 .1358892 .0896692 .0234196
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.3497639 .2395591 .1666667 .1607649 .0969072 .0443872 .0052203
.1316041 .1719113 .1666667 .1651026 .1358892 .0896692 .0234196
.018632 .0885296 .1666667 .1741325 .2672036 .3659436 .4713601
The probabilities behave as we would expect: in the case where
2
= 35/12, we get
the uniform distribution. With variances smaller than this, the probability distribution
puts more emphasis on the values 3 and 4, while with higher variances the distribution
becomes bimodal with greater probability being attached to extreme values. This output
does not reveal that in all cases the
1
estimate is basically zero. The reason for this
is that with a symmetrical distribution of x
i
values around the population mean, the
mean is no longer informative and all the information about the distribution of p derives
from the second constraint. If we force p
4
= p
5
= 0 so that the distribution is no longer
symmetrical, the rst constraint becomes informative, as shown in this output:
(Continued on next page)
324 Maximum entropy estimation
. maxentropy x dev2 if x!=5&x!=4, matrix(y) generate(p5, replace)
Cross entropy estimates
Variable lambda
x .0119916
dev2 .59568007
p values returned in p5
constraints given in matrix y
. list x p5 if e(sample), noobs
x p5
1 .4578909
2 .0427728
3 .0131515
6 .4861848
This example shows how to overwrite an existing variable and demonstrates that the
command allows if and in qualiers. It also shows how to use the e(sample) function.
4.2 Calibrating a survey
The basic point of calibration is to adjust the sampling weights so that the marginal
totals in dierent categories correspond to the population totals. Typically, the ad-
justments are made on demographic (for example, age and gender) and spatial vari-
ables. Early approaches included iterative raking procedures (Deming and Stephan
1940). These were generalized in the CALMAR routines described in Deville and S arndal
(1992). The idea of using a minimum information loss criterion for this purpose is not
original (see, for instance, Merz and Stolze [2008]), although it does not seem to have
been appreciated that the procedure leads to identical estimates as iterative raking-ratio
adjustments, if those adjustments are iterated to convergence.
The major advantage of using the cross-entropy approach rather than raking is that it
becomes straightforward to incorporate constraints that do not include marginal totals.
In many household surveys, for instance, it is plausible that mismatches between the
sample and the population arise due to dierential success in sampling household types
rather than in enumerating individuals within households. Under these conditions, it
makes sense to require that all raising weights within a household be identical. I give
an example below that shows how cross-entropy estimation with such a constraint can
be feasibly implemented.
These capacities also exist within other calibration macros and commands. The
advantage of the maxentropy command is that it can do so within Stataand it is
fairly easy and quick to use.
To demonstrate these possibilities, we load example1.dta, which contains a hypo-
thetical survey with a set of prior weights. The sum of these weights by stratum and
M. Wittenberg 325
gender is given in table 1, where we have also indicated the population totals to which
the weights should gross.
Table 1. Sum of weights from example1.dta by stratum, gender, and gross weight for
population totals
gender
stratum 0 1 Margin Required
0 100 400 500 1600
1 300 200 500 400
Margin 400 600 1000
Required 1200 800 2000
The weights can be adjusted to these totals by using the downloadable survwgt
command. To use the maxentropy command, we need to convert the desired constraints
from population totals into population means. That is straightforward because
N =
n
i=1
w
i
(12)
N
gender=0
=
n
i=1
w
i
1(gender = 0) (13)
where 1(gender = 0) is the indicator function. So dividing everything by N, the popu-
lation total, we get a set of constraints that look identical to those used earlier:
1 =
n
i=1
w
i
N
=
n
i=1
p
i
Pr (gender = 0) =
n
i=1
w
i
N
1(gender = 0)
=
n
i=1
p
i
1(gender = 0)
We could obviously add a condition for the proportion where gender = 1, but because
of the adding-up constraint, that would be redundant. If we have k categories for a
particular variable, we can only use k 1 constraints in our estimation.
In this particular example, the constraint vector is contained in the constraint
variable. The syntax of the command in this case is
maxentropy constraint stratum gender, generate(wt3) prior(weight) total(2000)
326 Maximum entropy estimation
We did not specify a matrix, so the rst variable is interpreted as the constraint
vector. We did specify a prior weight and asked Stata to convert the calculated proba-
bilities to raising weights by multiplying them by 2,000. A comparison with the raked
weights conrms them to be identical in this case.
We can check whether the constraints were correctly rendered by retrieving the
constraint matrix used in the estimation:
. matrix C=e(constraint)
. matrix list C
C[2,1]
c1
stratum .2
gender .40000001
We see that E(stratum) = 0.2 and E(gender) = 0.4. Means of dummy variables
are, of course, just population proportions; that is, the proportion in stratum = 1 is
0.2 and the proportion where gender = 1 is 0.4.
4.3 Imposing constant weights within households
In most household surveys, the household is the unit that is sampled and the individuals
are enumerated within it. Consequently, the probability of including an individual
conditional on the household being selected is 1. This suggests that the weight attached
to every individual within a household should be equal. We can impose this restriction
with a fairly simple ploy. We rewrite constraint (12) by rst summing over individuals
within the household (hhsize) and then summing over households as
N =
i
w
ih
=
h
hhsize
h
w
h
that is,
N =
h
w
h
where w
ih
is the weight of individual i within household h, equal to the common weight
w
h
. This constraint can again be written in the form of probabilities as
1 =
h
w
h
N
that is,
1 =
h
p
h
M. Wittenberg 327
Consider now any other constraint involving individual aggregates [for example, (13)]
N
x
=
n
i=1
w
i
x
i
=
i
w
ih
x
ih
=
h
w
h
_
i
x
ih
_
N
x
N
=
h
w
h
hhsize
h
N
i
x
ih
hhsize
h
Consequently,
E (x) =
h
p
h
m
xh
(14)
The term m
xh
is just the mean of the x variable within household h.
If the prior weight q
h
is similarly constant within households (as it should be if it is
a design weight), then we similarly create a new variable
q
h
= hhsize
h
q
h
We can then write the cross-entropy objective function as
I (p, q) =
n
i=1
p
i
ln
_
p
i
q
i
_
=
i
p
ih
ln
_
p
ih
q
ih
_
=
i
p
h
ln
_
p
h
hhsize
h
q
h
hhsize
h
_
=
i
p
h
ln
_
p
h
q
h
_
=
h
hhsize
h
p
h
ln
_
p
h
q
h
_
=
h
p
h
ln
_
p
h
q
h
_
In short, the objective function evaluated over all individuals and imposing the con-
straint p
ih
= p
h
for all i is identical to the objective function evaluated over house-
holds where the probabilities have been adjusted to p
h
and q
h
. We therefore run the
maxentropy command on a household-level le, with the population constraints given
by (14). Our cross-entropy estimates can then be retrieved as
p
h
=
p
h
hhsize
h
328 Maximum entropy estimation
We can check that the weights obtained in this way do, in fact, obey all the
restrictionsthey are obviously constant within household, and when added up over
the individuals, they reproduce the required totals.
4.4 Calibrating the South African National Income Dynamics Survey
To assess the performance of the maxentropy command on a more realistic problem, we
consider the problem of calibrating South Africas National Income Dynamics Survey.
This was a nationally representative sample of around 7,300 households and around
30,000 individuals. From the sampling design, a set of design weights were calculated,
but application of these weights to the realized sample led to a severe undercount when
compared with the ocial population estimates.
The calibration was to be done to reproduce the nine provincial population counts
and 136 age sex race cell totals. One practical diculty that was immediately
encountered was how to treat individuals where age, sex, or race information was miss-
ing, because this category does not exist in the national estimates. It was decided to
keep the relative weights of the missing observations constant through the calibration,
creating a 137th age, sex, and race category. From each group of dummy variables, one
category had to be omitted, creating altogether 144 (or 8 + 136) constraints.
hhcollapsed.dta contains household-level means of all these variables plus the
household design weights. The code to create cross-entropy weights that are constant
within households is given by the following:
use hhcollapsed
maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)
replace hw=hw/hhsize
matrix list e(constraint)
With 144 constraints and 7,305 observations, the command took 18 seconds to cal-
culate the new weights on a standard desktop computer.
M. Wittenberg 329
In this context, the estimates prove informative. The output of the command is
. maxentropy constraint P1-WFa80, prior(q) generate(hw) total(48687000)
Cross entropy estimates
Variable lambda
P1 -.15945276
P2 .00735986
P3 .14000206
(output omitted )
IMa75 15.402056
IMa80 8.6501559
IFa_0 -7.0753612
IFa_5 2.3584972
(output omitted )
IFa75 -9.2778495
IFa80 14.142518
(output omitted )
WFa70 .05009103
WFa75 .90961156
WFa80 4.6868009
p values returned in hw
constraints given in variable constraints
The huge coecients for old Indian males and old Indian females suggests that the
population constraints aected the weights for these categories substantially. Given the
large number of constraints, mistakes are possible. The easiest way to check that the
command has worked correctly is to add up the weights within categories and to check
that they add up to the intended totals. Listing the constraint matrix used by the
command is also a useful check. In this case, the labeling of the rows does help:
. matrix list e(constraint)
e(constraint)[144,1]
c1
P1 .10803039
P2 .13514069
P3 .02320805
P4 .05914972
P5 .20764017
P6 .07044568
P7 .21462312
P8 .07373177
AMa_0 .04486157
AMa_5 .04584822
(output omitted )
WFa75 .0012318
WFa80 .00147087
The rst eight constraints are the province proportions followed by the proportions
in the age, sex, and race cells.
330 Maximum entropy estimation
5 Conclusion
This article introduced the power of maximum entropy and minimum cross-entropy
estimation. The maxentropy command uses Statas powerful maximum-likelihood esti-
mation routines to provide fast estimates of even complicated problems. I have shown
how the command can be used to calibrate a survey to a set of known population totals
while imposing restrictions like constant weights within households.
6 References
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sample fre-
quency table when the expected marginal totals are known. Annals of Mathematical
Statistics 11: 427444.
Deville, J.-C., and C.-E. S arndal. 1992. Calibration estimators in survey sampling.
Journal of the American Statistical Association 87: 376382.
Deville, J.-C., C.-E. S arndal, and O. Sautory. 1993. Generalized raking procedures in
survey sampling. Journal of the American Statistical Association 88: 10131020.
Golan, A., G. G. Judge, and D. Miller. 1996. Maximum Entropy Econometrics: Robust
Estimation with Limited Data. Chichester, UK: Wiley.
Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review 106:
620630.
Merz, J., and H. Stolze. 2008. Representative time use data and new harmonised cali-
bration of the American Heritage Time Use Data 19651999. electronic International
Journal of Time Use Research 5: 90126.
Mittelhammer, R. C., G. G. Judge, and D. J. Miller. 2000. Econometric Foundations.
Cambridge: Cambridge University Press.
About the author
Martin Wittenberg teaches core econometrics and microeconometrics to graduate students in
the Economics Department at the University of Cape Town.
The Stata Journal (2010)
10, Number 3, pp. 331338
bacon: An eective way to detect outliers in
multivariate data using Stata (and Mata)
Sylvain Weber
University of Geneva
Department of Economics
Geneva, Switzerland
sylvain.weber@unige.ch
Abstract. Identifying outliers in multivariate data is computationally intensive.
The bacon command, presented in this article, allows one to quickly identify out-
liers, even on large datasets of tens of thousands of observations. bacon constitutes
an attractive alternative to hadimvo, the only other command available in Stata
for the detection of outliers.
Keywords: st0197, bacon, hadimvo, outliers detection, multivariate outliers
1 Introduction
The literature on outliers is abundant, as proved by Barnett and Lewiss (1994) bibli-
ography of almost 1,000 articles. Despite this considerable research by the statistical
community, knowledge apparently fails to spill over, so proper methods for detecting
and handling outliers are seldom used by practitioners in other elds.
The reason is likely that algorithms implemented for the detection of outliers are
sparse. Moreover, the few algorithms available are so time-consuming that using them
may be discouraging. Until now, hadimvo was the only command in Stata available for
identifying outliers. Anyone who has tried to use hadimvo on large datasets, however,
knows it may take hours or even days to obtain a mere dummy variable indicating which
observations should be considered as outliers.
The new bacon command, presented in this article, provides a more ecient way
to detect outliers in multivariate data. It is named for the blocked adaptive computa-
tionally ecient outlier nominators (BACON) algorithm proposed by Billor, Hadi, and
Velleman (2000). bacon is a simple modication of the methodology proposed by Hadi
(1992, 1994) and implemented in hadimvo, but bacon is much less computationally in-
tensive. As a result, bacon runs many times faster than hadimvo, even though both
commands end up with similar sets of outliers. Identifying multivariate outliers thus
becomes fast and easy in Stata, even with large datasets of tens of thousands of obser-
vations.
c 2010 StataCorp LP st0197
332 Multivariate outliers detection
2 The BACON algorithm
The BACON algorithm was proposed by Billor, Hadi, and Velleman (2000). The reader
who is interested in details is referred to that original article, because only a brief
presentation is provided here.
In step 1, an initial subset of m outlier-free observations has to be identied out of
a sample of n observations and over p variables. Any of several distance measures could
be used as a criterion, and the Mahalanobis distance seems especially well adapted.
It possesses the desirable property of being scale-invarianta great advantage when
dealing with variables of dierent magnitudes or with dierent units. The Mahalanobis
distance of a p-dimensional vector x
i
= (x
i1
, x
i2
, . . . , x
ip
)
T
from a group of values with
mean x = (x
1
, x
2
, . . . , x
p
)
T
and covariance matrix S is dened as
d
i
(x, S) =
_
(x
i
x)
T
S
1
(x
i
x) , i = 1, 2, . . . , n
The initial basic subset is given by the m observations with the smallest Mahalanobis
distances from the whole sample. The subset size m is given by the product of the
number of variables p and a parameter chosen by the analyst.
Billor, Hadi, and Velleman (2000) also proposed using distances from the medians
for this rst step. This second version of the algorithm is also implemented in bacon.
Distances from the medians are not scale-invariant, so they should be used carefully if
the variables analyzed are of dierent magnitudes.
In step 2, Mahalanobis distances from the basic subset are computed:
d
i
(x
b
, S
b
) =
_
(x
i
x
b
)
T
S
1
b
(x
i
x
b
) , i = 1, 2, . . . , n (1)
In step 3, all observations with a distance smaller than some thresholda corrected
percentile of a
2
distributionare added to the basic subset.
Steps 2 and 3 are iterated until the basic subset no longer changes. Observations
excluded from the nal basic subset are nominated as outliers, whereas those inside the
nal basic subset are nonoutliers.
The dierence in the algorithm proposed by Hadi (1992, 1994) is that observa-
tions are added by blocks in the basic subset instead of observation by observation.
Thus some time is spared through a reduction of the number of iterations. Neverthe-
less, it is important to note that the performance of the algorithm is not altered, as
Billor, Hadi, and Velleman (2000) and section 5 of this article show.
The reduction in the number of iterations is not the only source of eciency gain.
Another major improvement lies in the way bacon is coded. When hadimvo was im-
plemented, Mata did not exist. Now, though, Mata provides signicant speed enhance-
ments to many computationally intensive tasks, like the calculation of Mahalanobis
distances. I therefore coded bacon so that it benets from Matas power.
S. Weber 333
3 Why Mata matters for bacon
The bacon command uses Mata, the matrix programming language available in Stata
since version 9. I explain here how Mata allows bacon to run very fast. This section
draws heavily on Baum (2008), who oers a general overview of Matas capabilities.
The BACON algorithm requires creating matrices from data, computing the distances
using (1), and converting the new matrix-containing distances back into the data. Op-
erations that convert Stata variables into matrices (or vice versa) require at least twice
the memory needed for that set of variables, so it stands to reason that using Statas
matrices would consume a lot of memory. On the other hand, Matas matrices are only
views of, not copies of, data. Hence, using Matas virtual matrices instead of Statas
matrices in bacon spares memory that can be used to run the computations faster.
Moreover, Statas matrices are unsuited for holding large amounts of data, their
maximal size being 11,000 11,000. Using Stata, it would not be possible to create
a matrix X = (x
1
, x
2
, . . . , x
i
, . . . , x
n
)
T
containing all observations of the database if
the n were larger than 11,000. One would thus have to cut the X matrix into pieces
to compute the distances in (1), which is obviously inconvenient. Mata circumvents
the limitations of Statas traditional matrix commands, thus allowing the creation of
virtually innite matrices (over 2 billion rows and columns). Thanks to Mata, I am
thus able to create a single matrix X containing all observations to whatever n. I then
use the powerful element-by-element operations available to compute the distances.
Mata is indeed ecient for handling element-by-element operations, whereas Stata
ado-le code written in the matrix language with explicit subscript references is slow.
Because the distances in (1) have to be computed for each individual at each iteration
of the algorithm, this feature of Mata provides another important eciency gain.
4 The bacon command
4.1 Syntax
The syntax of bacon is as follows:
bacon varlist
_
if
_
in
, generate(newvar1
_
newvar2
)
_
replace
percentile(#) version(1 | 2) c(#)
4.2 Options
generate(newvar1
_
newvar2
statistics.
Similarly, some contiguous groups of SNPs often called haplotype blocks, may exhibit
high levels of pairwise linkage disequilibrium (Gabriel et al. 2002; Goldstein 2001). High
levels of linkage disequilibrium between two SNPs indicate that much of their statistical
information is redundant, so both SNPs are not necessary for association analyses. One
of the SNPs, called a tagSNP (Zhang et al. 2004), can be selected using one of several
algorithms. A tagSNP can be used in place of the group of redundant SNPs. Typically,
there are several tagSNPs in a group of contiguous SNPs found on a chromosome.
Haploview (Barrett et al. 2005) is a popular software package used for calculating
and visualizing the linkage disequilibrium statistics r
2
and D
PhaseOutputFile is the name of the PHASE output le that contains the inferred haplo-
types. It will have the le extension .out.
3.2 Options
markers(lename) allows the user to specify an ASCII le that contains the names of
the markers included in the haplotype. If the original genotype data were exported
to PHASE using the phaseout command, the marker names will be automatically
saved to a le named MarkerList.txt. If that is the case, then the option would
be markers("MarkerList.txt"). Alternatively, the user can save a space-delimited
list of marker names in an ASCII le and use the markers("lename.txt") option.
positions(lename) allows the user to specify an ASCII le that contains the posi-
tions of the markers. If the original genotype data were exported to PHASE us-
ing the phaseout command, the marker positions will be automatically saved to
a le named PositionList.txt. If that is the case, then the option would be
positions("PositionList.txt"). Alternatively, the user can save a space-delimit-
ed list of marker positions in an ASCII le and use the positions("lename.txt")
option.
3.3 Examples
Using the default les created by phaseout:
. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")
J. C. Huber Jr. 365
Using the les created by the user:
. phasein VEGF.out, markers("UserMarkerList.txt") positions("UserPositionList.txt")
4 The haploviewout command
The haploviewout command exports haplotype data from Stata to a pair of les. The
le Filename DataInput.txt contains the marker data for each individual, with the
alleles recoded as follows: missing = 0, A = 1, C = 2, G = 3, and T = 4.
D001 D001 2 2 4
D001 D001 2 4 4
D002 D002 2 2 4
D002 D002 4 2 4
The le Filename MarkerInput.txt contains the marker names and positions in
two columns:
rs1413711 674
rs3024987 836
rs3024989 1955
4.1 Syntax
haploviewout SNPlist, idvariable(varname) filename(lename)
_
positions(string) familyid(variable) poslabel
SNPlist is a list of SNP variables in long format (that is, one row per chromosome).
If your data are in wide format, you can convert them to long format by using the
reshape command.
Haploview will not accept multiallelic markers.
4.2 Options
idvariable(varname) is required to specify the variable that contains the individual
identiers.
filename(lename) is required to name the two ASCII les that will be created. Those
les will have the extensions DataInput.txt and MarkerInput.txt appended
to lename. For example, the filename("VEGF") option will create a le named
VEGF DataInput.txt and a le named VEGF MarkerInput.txt. To open the les in
Haploview, select File > Open new data and click on the tab labeled Haps For-
mat. Click on the Browse button next to the box labeled Data File and select the
le VEGF DataInput.txt. Next click on the Browse button next to the box labeled
Locus Information File and select the le VEGF MarkerInput.txt.
366 Using Stata with PHASE and Haploview
positions(string) allows the user to specify a space-delimited list of the marker posi-
tions.
familyid(variable) allows the user to specify the variable that contains family identi-
ers if relatives are included in the dataset. If familyid() is omitted, the
idvariable() will be automatically substituted for the familyid().
poslabel will automatically extract the SNP positions from the variable label of each
SNP if the haplotype data were created using the commands phaseout and phasein.
The positions for each marker are stored in the variable label of each SNP.
4.3 Examples
Using the default les created by phaseout:
. phaseout rs1413711 rs3024987 rs3024989, idvariable("id") filename("VEGF.inp")
> missing("X/X 9/9") positions("674 836 1955") separator("/")
. phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")
. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")
> poslabel
Using the les created by the user:
. haploviewout rs1413711 rs3024987 rs3024989, idvariable(id) filename("VEGF")
> positions("674 836 1955")
5 Discussion
Many young and rapidly evolving elds of inquiry, including genetic association studies,
use a variety of boutique software packages. While it would be very convenient to have
Stata commands that accomplish the same tasks, the time and programming expertise
required does not make this a practical option. However, a suite of commands that
allows easy exporting and importing of data from Stata to other specialized software
seems to be an ecient way for Stata users to accomplish specialized analytical tasks.
6 Acknowledgments
This work was supported in part by grant 1 R01 DK073618-02 from the National In-
stitute of Diabetes and Digestive and Kidney Diseases and by grant 2006-35205-16715
from the United States Department of Agriculture. The author would like to thank
Drs. Loren Skow, Krista Fritz, and Candice Brinkmeyer-Langford of the Texas A&M
College of Veterinary Medicine and Roger Newson of the Imperial College London for
their very useful feedback.
J. C. Huber Jr. 367
7 References
Akey, J., L. Jin, and M. Xiong. 2001. Haplotypes vs single marker linkage disequilibrium
tests: What do we gain? European Journal of Human Genetics 9: 291300.
Barrett, J. C., B. Fry, J. Maller, and M. J. Daly. 2005. Haploview: Analysis and
visualization of LD and haplotype maps. Bioinformatics 21: 263265.
Cordell, H. J., and D. G. Clayton. 2005. Genetic association studies. Lancet 366:
11211131.
Devlin, B., and N. Risch. 1995. A comparison of linkage disequilibrium measures for
ne-scale mapping. Genomics 29: 311322.
Gabriel, S. B., S. F. Schaner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Hig-
gins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo,
R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler. 2002. The structure
of haplotype blocks in the human genome. Science 296: 22252229.
Goldstein, D. B. 2001. Islands of linkage disequilibrium. Nature Genetics 29: 109111.
Marchenko, Y. V., R. J. Carroll, D. Y. Lin, C. I. Amos, and R. G. Gutierrez. 2008.
Semiparametric analysis of casecontrol genetic data in the presence of environmental
factors. Stata Journal 8: 305333.
Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S.
Qin, H. M. Munro, G. R. Abecasis, P. Donnelly, and The International HapMap Con-
sortium. 2006. A comparison of phasing algorithms for trios and unrelated individuals.
American Journal of Human Genetics 78: 437450.
SeattleSNPs. 2009. NHLBI Program for Genomic Applications.
http://pga.gs.washington.edu.
Stephens, M., and P. Donnelly. 2003. A comparison of Bayesian methods for haplotype
reconstruction from population genotype data. American Journal of Human Genetics
73: 11621169.
Stephens, M., N. J. Smith, and P. Donnelly. 2001. A new statistical method for haplo-
type reconstruction from population data. American Journal of Human Genetics 68:
978989.
The International HapMap Consortium. 2005. A haplotype map of the human genome.
Nature 437: 12991320.
Zhang, K., Z. S. Qin, J. S. Liu, T. Chen, M. S. Waterman, and F. Sun. 2004. Haplotype
block partitioning and tag SNP selection using genotype data and their applications
to association studies. Genome Research 14: 908916.
368 Using Stata with PHASE and Haploview
About the author
Chuck Huber is an assistant professor of biostatistics at the Texas A&M Health Science Center
School of Rural Public Health in the Department of Epidemiology and Biostatistics. He works
on projects in a variety of topical areas, but his primary area of interest is statistical genetics.
The Stata Journal (2010)
10, Number 3, pp. 369385
simsum: Analyses of simulation studies
including Monte Carlo error
Ian R. White
MRC Biostatistics Unit
Institute of Public Health
Cambridge, UK
ian.white@mrc-bsu.cam.ac.uk
Abstract. A new Stata command, simsum, analyzes data from simulation studies.
The data may comprise point estimates and standard errors from several analysis
methods, possibly resulting from several dierent simulation settings. simsum can
report bias, coverage, power, empirical standard error, relative precision, average
model-based standard error, and the relative error of the standard error. Monte
Carlo errors are available for all of these estimated quantities.
Keywords: st0200, simsum, simulation, Monte Carlo error, normal approximation,
sandwich variance
1 Introduction
Simulation studies are an important tool for statistical research (Burton et al. 2006), but
they are often poorly reported. In particular, to understand the role of chance in results
of simulation studies, it is important to estimate the Monte Carlo (MC) error, dened
as the standard deviation of an estimated quantity over repeated simulation studies.
However, this error is often not reported: Koehler, Brown, and Haneuse (2009) found
that of 323 articles reporting the results of a simulation study in Biometrics, Biometrika,
and the Journal of the American Statistical Association in 2007, only 8 articles reported
the MC error.
This article describes a new Stata command, simsum, that facilitates analyses of
simulated data. simsum analyzes simulation studies in which each simulated dataset
yields point estimates by one or more analysis methods. Bias, empirical standard error
(SE), and precision relative to a reference method can be computed for each method. If,
in addition, model-based SEs are available, then simsum can compute the average model-
based SE, the relative error in the model-based SE, the coverage of nominal condence
intervals, and the power to reject a null hypothesis. MC errors are available for all
estimated quantities.
c 2010 StataCorp LP st0200
370 simsum: Analyses of simulation studies including Monte Carlo error
2 The simsum command
2.1 Syntax
simsum accepts data in wide or long format.
In wide format, data contain one record per simulated dataset, with results from
multiple analysis methods stored as dierent variables. The appropriate syntax is
simsum estvarlist
_
if
_
in
_
, true(expression) options
where estvarlist is a varlist containing point estimates from one or more analysis meth-
ods.
In long format, data contain one record per analysis method per simulated dataset,
and the appropriate syntax is
simsum estvarname
_
if
_
in
_
, true(expression) methodvar(varname)
id(varlist) options
i
with SE s
i
. Dene
=
1
n
i
V
b
=
1
n 1
i
_
_
2
s
2
=
1
n
i
s
2
i
V
s
2 =
1
n 1
i
_
s
2
i
s
2
_
2
Performance of
: Bias and empse
Bias is dened as E(
i
) and estimated by
estimated bias =
MC error =
_
V
b
/n (1)
Precision is measured by the empirical standard deviation SD
_
i
_
and is estimated by
empirical standard deviation =
_
V
b
MC error =
_
V
b
/2(n 1)
assuming
is normally distributed, as then (n 1)V
b
/var
_
_
2
n1
.
I. R. White 377
Estimation method comparison: relprec
In a small change of notation, consider two estimators
1
and
2
with values
1i
and
2i
in the ith simulated dataset. The relative gain in precision for
2
compared with
1
is
relative gain in precision = V
b
1
/V
b
2
MC error
2V
b
1
V
b
2
1
2
12
n 1
where
12
is the correlation of
1
with
2
.
The MC error expression can be proved by observing the following: 1) var
_
log V
b
1
_
=
var
_
log V
b
2
_
= 2/(n 1); 2) var
_
log
_
V
b
1
/V
b
2
__
= 4(1
V
)/(n 1) where
V
=
corr
_
V
b
1
, V
b
2
_
; and 3)
V
=
2
12
. Result 3 may be derived by observing that V
b
1/n
i
_
_
2
so that under a bivariate normal assumption for
_
1
,
2
_
,
n cov
_
V
b
1
, V
b
2
_
cov
_
_
1
_
2
,
_
2
_
2
_
= cov
_
_
1
_
2
, E
_
_
2
_
2
1
__
= cov
_
_
1
_
2
,
2
12
V
b
2
/V
b
1
_
1
_
2
_
= 2
2
12
V
b
1
V
b
2
where the third step follows because
_
2
_
1
is normal with mean
12
_
V
b
2
/V
b
1
_
1
_
and constant variance.
Performance of model-based SE s
i
: modelse and relerror
The average model-based SE is (by default) computed on the variance scale, because
standard theory yields unbiased estimates of the variance, not of the standard deviation.
average model-based SE s =
_
s
2
MC error
_
V
s
2/4ns
2
using the Taylor series approximation var (X) var
_
X
2
_
/4E(X)
2
.
378 simsum: Analyses of simulation studies including Monte Carlo error
We can now compute the relative error in the model-based SE as
relative error = s/
_
V
b
1 (2)
MC error
_
s/
_
V
b
__
V
s
2/
_
4ns
4
_
+ 1/2(n 1) (3)
assuming that s and V
b
i
s
i
MC error =
1
n
i
(s
i
s)
2
with consequent adjustments to equations (2) and (3).
Joint performance of
and s
i
: Cover and power
Let z
/2
be the critical value from the normal distribution, or (if the number of degrees
of freedom has been specied) the critical value from the appropriate t distribution.
The coverage of a nominal 100(1 )% condence interval is
coverage C =
1
n
i
1
_
|
i
| < z
/2
s
i
_
MC error =
_
C(1 C)/n
where 1() is the indicator function. The power of a signicance test at the level is
power P =
1
n
i
1
_
|
i
| z
/2
s
i
_
MC error =
_
P(1 P)/n
Robust MC errors
Several of the MC errors presented above require a normality assumption. Alternative
approximations can be derived using an estimating equations method. The empirical
standard deviation,
_
V
b
i
_
n
n 1
_
_
2
2
_
= 0
I. R. White 379
The relative precision of
2
compared with
1
can be written as the solution
of
i
_
_
1i
1
_
2
+ 1
__
2i
2
_
2
_
= 0
The relative error in the model-based SE can be written as the solution
of
i
_
s
2
i
_
+ 1
_
2
_
_
2
_
= 0
provided that the modelsemethod(rmse) method is used. (If modelsemethod(mean) is
specied, it is ignored in computing robust MC errors.) Ignoring the uncertainty in the
sample means ,
1
, and
2
, each estimating equation is of the form
i
_
T
i
f
_
_
B
i
_
= 0
so the sandwich variance (White 1982) is given by
var
_
f
_
__
i
_
T
i
f
_
_
B
i
_
2
_
i
B
i
_
2
and using the delta method,
var
_
_
var
_
f
_
__
/f
_
2
Finally, as an attempt to allow for uncertainty in the sample means, we multiply the
sandwich variance by n/(n 1). A rationale is that this agrees exactly with (1) if the
method is applied to the MC error of the bias. However, most simulation studies are
large enough that this correction is unimportant.
5 Evaluations
Most of the formulas used by simsum to compute MC errors involve approximations, so
I evaluated them in two simulation studies.
5.1 Multiple imputation, revisited
First, I repeated 250 times the simulation study described in section 3. The data have
the same format as before, with a new variable, simno, identifying the 250 dierent
simulation studies. I ran simsum twice. In the rst run, each quantity and its MC error
was computed in each simulation study:
. simsum b, true(0.5) methodvar(method) id(dataset) se(se) mcse by(simno)
> bias empse relprec modelse relerror cover power nolist clear
Reshaping data to wide format ...
Starting to process results ...
Results are now in memory.
380 simsum: Analyses of simulation studies including Monte Carlo error
The data are now held in memory, with one record for each statistic for each of the
250 simulation studies. The statistics are identied by the values of a newly created
numerical variable statnum, and the dierent simulation studies are still identied by
simno. The variables bCC, bMI LOGT, and bMI T contain the analysis results for the three
methods. MC errors in variables are suxed with mcse. In the second run, these values
are treated as ordinary output from a simulation study, and the average calculated MC
error is compared with the empirical MC error.
. simsum bCC bMI_LOGT bMI_T, sesuffix(_mcse) by(statnum) mcse gen(newstat)
> empse modelse relerror nolist clear
Warning: found 250 observations with missing values
Starting to process results ...
Results are now in memory.
The 250 observations with missing values refer to the relative precisions, which are
missing for the reference method (CC). Average calculated MC errors for each statistic are
compared in table 1 with empirical MC errors. The calculated MC errors are naturally
similar to those reported in the single simulation study above (some values have been
multiplied by 1,000 for convenience). Empirical MC errors are close to the model-based
values. The only exception is for coverage, where the model-based MC errors appear
rather small for methods CC and MI LOGT. This is likely to be a chance nding, because
there is no doubt about the accuracy of the model-based MC formula for this statistic.
I. R. White 381
T
a
b
l
e
1
.
S
i
m
u
l
a
t
i
o
n
s
t
u
d
y
c
o
m
p
a
r
i
n
g
t
h
r
e
e
w
a
y
s
t
o
h
a
n
d
l
e
i
n
c
o
m
p
l
e
t
e
c
o
v
a
r
i
a
t
e
s
i
n
a
C
o
x
m
o
d
e
l
:
C
o
m
p
a
r
i
s
o
n
o
f
a
v
e
r
a
g
e
c
a
l
c
u
l
a
t
e
d
M
C
e
r
r
o
r
(
C
a
l
c
)
w
i
t
h
e
m
p
i
r
i
c
a
l
M
C
e
r
r
o
r
(
E
m
p
)
f
o
r
v
a
r
i
o
u
s
s
t
a
t
i
s
t
i
c
s
S
t
a
t
i
s
t
i
c
1
C
C
m
e
t
h
o
d
M
I
L
O
G
T
m
e
t
h
o
d
M
I
T
m
e
t
h
o
d
E
m
p
C
a
l
c
%
e
r
r
o
r
2
E
m
p
C
a
l
c
%
e
r
r
o
r
2
E
m
p
C
a
l
c
%
e
r
r
o
r
2
B
i
a
s
1
0
0
0
4
.
7
9
4
.
7
4
1
.
1
(
4
.
4
)
4
.
2
3
4
.
1
7
1
.
3
(
4
.
4
)
4
.
2
2
4
.
1
8
1
.
0
(
4
.
4
)
E
m
p
S
E
1
0
0
0
3
.
3
7
3
.
3
6
0
.
3
(
4
.
5
)
3
.
1
1
2
.
9
5
5
.
2
(
4
.
3
)
3
.
1
1
2
.
9
6
4
.
9
(
4
.
3
)
R
e
l
P
r
e
c
.
.
.
4
.
2
1
3
.
9
7
5
.
7
(
4
.
2
)
4
.
1
3
3
.
9
7
4
.
1
(
4
.
3
)
M
o
d
S
E
1
0
0
0
0
.
5
2
0
.
5
0
3
.
1
(
4
.
3
)
0
.
5
9
0
.
5
9
0
.
3
(
4
.
5
)
0
.
6
0
0
.
5
9
3
.
1
(
4
.
4
)
R
e
l
E
r
r
2
.
1
6
2
.
2
2
2
.
9
(
4
.
6
)
2
.
4
0
2
.
3
4
2
.
6
(
4
.
4
)
2
.
4
3
2
.
3
3
4
.
2
(
4
.
3
)
C
o
v
e
r
0
.
6
2
0
.
7
0
1
3
.
5
(
5
.
1
)
0
.
6
1
0
.
6
8
1
1
.
3
(
5
.
0
)
0
.
6
7
0
.
6
8
1
.
8
(
4
.
6
)
P
o
w
e
r
0
.
7
4
0
.
7
3
1
.
4
(
4
.
4
)
0
.
5
9
0
.
5
9
0
.
4
(
4
.
5
)
0
.
6
0
0
.
5
9
2
.
4
(
4
.
4
)
1
S
t
a
t
i
s
t
i
c
s
a
r
e
a
b
b
r
e
v
i
a
t
e
d
a
s
f
o
l
l
o
w
s
:
B
i
a
s
,
b
i
a
s
i
n
p
o
i
n
t
e
s
t
i
m
a
t
e
;
E
m
p
S
E
,
e
m
p
i
r
i
c
a
l
S
E
;
R
e
l
P
r
e
c
,
%
g
a
i
n
i
n
p
r
e
c
i
s
i
o
n
r
e
l
a
t
i
v
e
t
o
m
e
t
h
o
d
C
C
;
M
o
d
S
E
,
R
M
S
m
o
d
e
l
-
b
a
s
e
d
S
E
;
R
e
l
E
r
r
,
r
e
l
a
t
i
v
e
%
e
r
r
o
r
i
n
S
E
;
C
o
v
e
r
,
c
o
v
e
r
a
g
e
o
f
n
o
m
i
n
a
l
9
5
%
c
o
n
d
e
n
c
e
i
n
t
e
r
v
a
l
;
P
o
w
e
r
,
p
o
w
e
r
o
f
5
%
l
e
v
e
l
t
e
s
t
.
2
R
e
l
a
t
i
v
e
%
e
r
r
o
r
i
n
a
v
e
r
a
g
e
c
a
l
c
u
l
a
t
e
d
S
E
,
w
i
t
h
i
t
s
M
C
e
r
r
o
r
i
n
p
a
r
e
n
t
h
e
s
e
s
.
382 simsum: Analyses of simulation studies including Monte Carlo error
5.2 Nonnormal joint distributions
In a second evaluation, I simulated 100,000 datasets of size n = 100 from the model
X N(0, 1), Y Bern(0.5). I then estimated the parameter in the logistic regression
model
logit P(Y = 1 | X) = +X (4)
in two ways: 1)
LR
was the maximum likelihood estimate from tting the logistic
regression model (4), and 2)
LDA
was the estimate from linear discriminant analysis
(LDA), tting the linear regression model
X| Y N
_
+X,
2
_
and taking
LDA
=
/
2
.
The 100,000 datasets were divided into 100 simulation studies each of 1,000 simu-
lated datasets. The quantities described above and their SEs were calculated for each
simulation study, except that power for testing = 0 was not computed because this
null hypothesis was true. Finally, the empirical MC error of each quantity across simula-
tion studies was compared with the average MC error estimated within each simulation
study.
Results are shown in table 2. The calculated MC error is adequate for all quantities
except for the relative precision of LDA compared with logistic regression, for which the
calculated SE is some three times too small. This appears to be due to the nonnormal
joint distribution of the parameter estimates shown in gure 1. The robust MC errors
perform well in all cases.
I. R. White 383
Table 2. Simulation study comparing LDA with logistic regression: Comparison of
empirical with average calculated MC errors for various statistics
Quantity Method Mean MC error
Empirical Average calculated
Normal Robust
Bias 1000 Logistic 0.41 6.79 6.71 .
LDA 0.41 6.66 6.57 .
Empirical SE 1000 Logistic 212.00 4.78 4.74 5.07
LDA 207.86 4.69 4.65 4.97
% gain in precision Logistic . . . .
LDA 4.027 0.124 0.048 0.131
Model SE 1000 Logistic 207.32 0.51 0.51 .
LDA 203.12 0.48 0.47 .
% error in model SE Logistic 2.16 2.13 2.20 2.26
LDA 2.23 2.18 2.20 2.30
% coverage Logistic 95.36 0.60 0.66 .
LDA 94.70 0.64 0.71 .
.
0
2
0
.
0
2
.
0
4
.
0
6
D
i
f
f
e
r
e
n
c
e
,
L
D
A
L
R
1 .5 0 .5 1
Average of LDA and LR
Figure 1. Scatterplot of the dierence
LDA
LR
against the average
_
LDA
+
LR
_
/2
in 2,000 simulated datasets
384 simsum: Analyses of simulation studies including Monte Carlo error
6 Discussion
I hope that simsum will help statisticians improve the reporting of their simulation
studies. In particular, I hope simsum will help them think about and report MC errors.
If MC errors are too large to enable the desired conclusions to be drawn, then it is
usually straightforward to increase the sample size, a luxury rarely available in applied
research.
For three statistics (empirical SE, and relative precision and relative error in model-
based SE), I have proposed two approximate MC error methods, one based on a normality
assumption and one based on a sandwich estimator. The MC error should only be taken
as a guide, so errors of some 1020% in calculating the MC error are of little importance.
In most cases, both MC error methods performed adequately. However, the normality-
based MC error was about three times too small when evaluating the relative precision of
two estimators with a highly nonnormal joint distribution (gure 1). It is good practice
to examine the marginal and joint distributions of parameter estimates in simulation
studies, and this practice should be used to guide the choice of MC error method.
Other methods are available for estimating MC errors. Koehler, Brown, and Haneuse
(2009) proposed more computationally intensive techniques that are available for im-
plementation in R. Other software (Doornik and Hendry 2009) is available with an
econometric focus.
7 Acknowledgment
This work was supported by MRC grant U.1052.00.006.
8 References
Burton, A., D. G. Altman, P. Royston, and R. L. Holder. 2006. The design of simulation
studies in medical statistics. Statistics in Medicine 25: 42794292.
Doornik, J. A., and D. F. Hendry. 2009. Interactive Monte Carlo Experimentation in
Econometrics Using PcNaive 5. London: Timberlake Consultants Press.
Koehler, E., E. Brown, and S. J.-P. A. Haneuse. 2009. On the assessment of Monte Carlo
error in simulation-based statistical analyses. American Statistician 63: 155162.
Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. 2nd
ed. Hoboken, NJ: Wiley.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227241.
. 2009. Multiple imputation of missing values: Further update of ice, with an
emphasis on categorical variables. Stata Journal 9: 466477.
van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multiple imputation of missing
blood pressure covariates in survival analysis. Statistics in Medicine 18: 681694.
I. R. White 385
White, H. 1982. Maximum likelihood estimation of misspecied models. Econometrica
50: 125.
White, I. R., and P. Royston. 2009. Imputing missing covariate values for the Cox
model. Statistics in Medicine 28: 19821998.
About the author
Ian R. White is a program leader at the MRC Biostatistics Unit in Cambridge, United Kingdom.
His research interests focus on handling missing data, noncompliance, and measurement error
in the analysis of clinical trials, observational studies, and meta-analysis. He frequently uses
simulation studies.
The Stata Journal (2010)
10, Number 3, pp. 386394
Projection of power and events in clinical trials
with a time-to-event outcome
Patrick Royston
Hub for Trials Methodology Research
MRC Clinical Trials Unit and University College London
London, UK
pr@ctu.mrc.ac.uk
Friederike M.-S. Barthel
Oncology Research & Development
GlaxoSmithKline
Uxbridge, UK
FriederikeB@ctu.mrc.ac.uk
Abstract. In 2005, Barthel, Royston, and Babiker presented a menu-driven Stata
program under the generic name of ART (assessment of resources for trials) to
calculate sample size and power for complex clinical trial designs with a time-to-
event or binary outcome. In this article, we describe a Stata tool called ARTPEP,
which is intended to project the power and events of a trial with a time-to-event
outcome into the future given patient accrual gures so far and assumptions about
event rates and other dening parameters. ARTPEP has been designed to work
closely with the ART program and has an associated dialog box. We illustrate the
use of ARTPEP with data from a phase III trial in esophageal cancer.
Keywords: st0013 2, artpep, artbin, artsurv, artmenu, randomized controlled trial,
time-to-event outcome, power, number of events, projection, ARTPEP, ART
1 Introduction
Barthel, Royston, and Babiker (2005) presented a menu-driven Stata program under
the generic name of ART (assessment of resources for trials) to calculate sample size and
power for complex clinical trial designs with a time-to-event or binary outcome. Briey,
the features of ART include multiarm trials, doseresponse trends, arbitrary failure-
time distributions, nonproportional hazards, nonuniform rates of patient entry, loss to
follow-up, and possible changes from allocated treatment. A full report on the method-
ology and its performancein particular, regarding loss to follow-up, nonproportional
hazards, and treatment crossoveris given by Barthel et al. (2006).
In this article, we concentrate on a new tool that addresses a practical issue in trials
with a time-to-event outcome. Because of staggered entry of patients and the gradual
maturing of the data, the accumulation of events from the date the trial opens is a
process that occurs over a relatively long period of time and with a variable course.
Trials are planned and their resources are assigned under certain critical assumptions.
c 2010 StataCorp LP st0013 2
P. Royston and F. M.-S. Barthel 387
If those assumptions are unrealistic, timely completion of the trial may be threatened.
Because the cumulative number of events is the key indicator of trial maturity and is
the parameter targeted in the sample-size calculation, it is of considerable interest and
relevance to monitor and project this number at particular points during the trial.
The new tool is called ARTPEP (ART projection of events and power). ARTPEP
comprises an ado-le (artpep) and an associated dialog box. It works in conjunction
with the ART system, of which the latest update is included with this article.
2 Example: A trial in advanced esophageal cancer
2.1 Sample-size calculation using ART
As an example, we describe sample-size calculation and ARTPEP analysis of a typical
cancer trial. The OE05 trial in advanced esophageal carcinoma is coordinated by the
MRC Clinical Trials Unit. The protocol is available online at http://www.ctu.mrc.ac.uk/
plugins/StudyDisplay/protocols/OE05%20Protocol%20Version%205%2031st%20July
%202008.pdf. The design, which comprises two randomized groups of patients with
equal allocation, aims to test the hypothesis that a new chemotherapy regimen, in
conjunction with surgery, improves overall survival at 3 years.
According to the protocol, the probability of 3-year survival in this patient group
is 30%, and the trial has 82% power at the 5% two-sided signicance level to detect
an improvement in overall survival to 38%. The overall sample size is stated to be 842
patients, and the required number of events is 673. The plan is to recruit patients over
6 years and to follow up with them for a further 2 years before performing the denitive
analysis of the outcome (overall survival).
The description in the protocol provides nearly all the ingredients for an ART sample-
size and power calculation. The only missing item is the target hazard ratio, which is
ln (0.38) / ln (0.30) = 0.80 under proportional hazards of the treatment eect (a standard
assumption). We rst use the artsurv command (Barthel, Royston, and Babiker 2005)
to verify the sample-size calculation and to set up some of the parameter values needed
by ARTPEP. We supply the other design features, and then we run the artsurv command
to compute the power and events:
(Continued on next page)
388 Projection of power and events in clinical trials
. artsurv, method(l) nperiod(8) ngroups(2) edf0(0.3, 3) hratio(1, 0.80) n(842)
> alpha(0.05) recrt(6)
ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.7, 19 October 2009)
A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,
MRC Clinical Trials Unit, London NW1 2DA, UK.
Type of trial Superiority - time-to-event outcome
Statistical test assumed Unweighted logrank test (local)
Number of groups 2
Allocation ratio Equal group sizes
Total number of periods 8
Length of each period One year
Survival probs per period (group 1) 0.669 0.448 0.300 0.201 0.134 0.090
0.060 0.040
Survival probs per period (group 2) 0.725 0.526 0.382 0.277 0.201 0.146
0.106 0.077
Number of recruitment periods 6
Number of follow-up periods 2
Method of accrual Uniform
Recruitment period-weights 1 1 1 1 1 1 0 0
Hazard ratios as entered (groups 1,2) 1, 0.80
Alpha 0.050 (two-sided)
Power (calculated) 0.824
Total sample size (designed) 842
Expected total number of events 673
Apart from small, unimportant dierences, the protocol power (0.82) and the number
of events (673) are consistent with ARTs results.
2.2 Analysis with ARTPEP
To run ARTPEP successfully, three preliminary steps are required:
1. You must activate the ART and ARTPEP items on the User menu by typing the
command artmenu on.
2. You must compute the relevant sample size for the trial using either the ART
dialog box or the artsurv command. This automatically sets up a global macro
called $S ARTPEP whose contents are used by the artpep command. (A slightly
more convenient alternative with the same result is to use the ART Settings...
button on the ARTPEP dialog box to set up the necessary quantities for ART
without having to run ART or artsurv separately.)
3. To set up additional parameters that ARTPEP needs, you must use the ARTPEP
dialog box, either by typing db artpep or by selecting User > ART > Artpep
from the menu.
As a worked example, we now imagine that the OE05 trial has been running for
1 year and has accrued 100 patients so far. Assuming the survival distribution to be
P. Royston and F. M.-S. Barthel 389
correct, when may we expect to complete the trial (that is, obtain the required number
of events)? To answer this question, we complete the three steps described above. The
resulting empty dialog box is shown in gure 1.
Figure 1. Incomplete ARTPEP dialog box
We now explain the various items that the dialog box needs. The name of the
corresponding option for the artpep command is given in square brackets:
ART Settings...: As already mentioned, this button may be used to set up the
parameters of an ART run if that has not been done already. It accesses the ART
dialog box.
Patients recruited in each period so far [pts]: A period here is 1 year, and we
have recruited 100 patients in the rst period. We therefore enter 100 for this
item.
Additional patients to be recruited [epts]: To get to the 842 patients (we will
use 850), we hope to recruit at about 150 patients per year for the next 5 years,
making a total of 6 years planned recruitment. We enter 150. The program knows
the period in which recruitment is to cease and, by default, repeats the number
150 over the next 5 periods. If we had expected a diering recruitment rate (say,
accelerating toward the end of the trial), we could have entered a dierent number
of patients to be recruited in each period.
390 Projection of power and events in clinical trials
Number of periods over which to project [eperiods]: Let us say we wish to project
events and power over the next 10 years. We enter 10.
Period in which recruitment cease [stoprecruit]: Here enter the number of periods
after which recruitment is to cease. The number must be no smaller than the
number of periods implied by Patients recruited in each period so far [pts]. If
the option is left blank, it is assumed that recruitment continues indenitely. As
already noted, we wish to stop recruitment at 850 patients, which we will achieve
by the end of period 6. We therefore enter 6 for this item.
Period to start reporting projections [startperiod]: Usually, we want to enter 1
here, signifying the start of the trial. By default, if the item is left blank, the
program assumes that the current period is intended. We enter 1.
Save using lename [using]: The numerical results of the artpep run can be saved
to a .dta le for a permanent record or for plotting. We leave the item blank.
Start date of trial (ddmmmyyyy) [datestart]: If we enter the start date, the output
from artpep is conveniently labeled with the calendar date of the end of each
period. We recommend using this option. We enter 01jan2009.
The completed ARTPEP dialog box is shown in gure 2.
Figure 2. Completed ARTPEP dialog box for the OE05 trial
P. Royston and F. M.-S. Barthel 391
After submitting the above setup to Stata (version 10 or later), we get the following
result:
. artpep, pts(100) $S ARTPEP epts(150) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009)
Date year #pats #C-events #events Power
31dec2009 1 100 9 17 0.06498
31dec2010 2 250 36 66 0.14480
31dec2011 3 400 79 146 0.26850
31dec2012 4 550 132 247 0.41622
31dec2013 5 700 193 362 0.56360
31dec2014 6 850 258 488 0.69209
31dec2015 7 850 314 597 0.77737
31dec2016 8 850 351 673 0.82423
31dec2017 9 850 375 726 0.85155
31dec2018 10 850 392 763 0.86825
31dec2019 11 850 403 789 0.87882
The program reports the total number of events (#events) and the number of events in
the control arm (#C-events), which are often of interest. The required total number of
events (that is, both arms combined) of 673 is projected to be reached on 31 December
2016, the end of period 8. We expect 351 events in the control arm by that time. The
projection is not surprising because the accrual gures that have been entered more or
less agree with the trial plan. Nevertheless, the output shows us the expected progress
of the number of events and the power over time. The trial may be monitored (and the
ARTPEP analysis updated) to follow its progress.
The dialog box has, as usual, created and run the necessary artpep command line.
The second item in the command is $S ARTPEP. As already mentioned, it contains
additional information needed by artpep. On displaying its contents, we nd
. display "$S ARTPEP"
alpha(.05) aratios() hratio(1, 0.80) ngroups(2) ni(0) onesided(0) trend(0)
> tunit(1) edf0(0.3, 3) median(0) method(l)
The key pieces of information here are hratio(1, 0.80) and edf0(0.3, 3), which
specify the hazard ratios in groups 1 and 2, and the survival function in group 1,
respectively. All the other items are default values and could be omitted in the present
example. The present example could have been run directly from the command line as
follows:
. artpep, pts(100) edf0(0.3, 3) epts(150) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)
2.3 Sensitivity analysis of the event rate
We have assumed a 30% survival probability 3 years after recruitment. Suppose, in fact,
that the patients do better than thattheir 3-year survival is 40% instead. What eect
would that have on the power and events timeline?
392 Projection of power and events in clinical trials
We need only change the edf0() option to edf0(0.4, 3):
. artpep, pts(100) epts(150) edf0(0.4, 3) eperiods(10) startperiod(1)
> stoprecruit(6) datestart(01jan2009) hratio(1, 0.8)
Date year #pats #C-events #events Power
31dec2009 1 100 7 13 0.05869
31dec2010 2 250 29 53 0.12410
31dec2011 3 400 65 119 0.22732
31dec2012 4 550 111 205 0.35714
31dec2013 5 700 165 306 0.49586
31dec2014 6 850 224 419 0.62612
31dec2015 7 850 277 522 0.72135
31dec2016 8 850 316 600 0.77974
31dec2017 9 850 345 660 0.81685
31dec2018 10 850 366 705 0.84128
31dec2019 11 850 382 739 0.85784
The time to observe the required number of events has advanced by more than 1 year,
to period 9 (31dec2017).
3 Syntax
Once you have gained a little experience with using the ARTPEP dialog box, you will
nd it more natural and ecient to use the command line. The syntax of artpep is as
follows:
artpep
_
using lename
, pts(numlist) edf0(slist0)
_
epts(numlist)
eperiods(#) stoprecruit(#) startperiod(#) datestart(ddmmmyyyy)
replace artsurv options
4 Options
pts(numlist) is required. numlist species the number of patients recruited in each
period since the start of the trial, that is, since randomization. See help on artsurv
for the denition of a period. The number of items in numlist denes the number
of periods of recruitment so far. For example, pts(23 12 25) species three initial
periods of recruitment, with recruitment of 23 patients in period 1, 12 in period 2,
and 25 in period 3. The current period would be period 3 and would be demarcated
by parallel lines in the output.
edf0(slist0) is required and gives the survival function in the control group (group 1).
This need not be one of the survival distributions to be compared in the trial, unless
hratio() = 1 for at least one of the groups. The format of slist0 is #1 [#2 . . . #r,
#1 #2 . . . #r]. Thus edf0(p
1
p
2
. . . p
r
, t
1
t
2
. . . t
r
) gives the value p
i
for the survival
function for the event time at the end of time period t
i
, i = 1, . . . , r. Instantaneous
event rates (that is, hazards) are assumed constant within time periods; that is, the
P. Royston and F. M.-S. Barthel 393
distribution of time-to-event is assumed to be piecewise exponential. When used in
a given calculation up to period T, t
r
may validly be less than, equal to, or greater
than T. If t
r
T, the rules described in the edf0() option of artsurv are applied
to compute the survival function at all periods T. If t
r
> T, the same calculation
is used but estimated survival probabilities for periods > T are not used in the
calculation at T, although they may of course be used in calculations (for example,
projections of sample size and events) for periods later than T. Be aware that use
of the median() option (an alternative to edf0()) and the fp() option of artsurv
may modify the eects and interpretation of edf0().
epts(numlist) species in numlist the number of additional patients to be recruited in
each period following the recruitment phase dened by the pts() option. For exam-
ple, pts(23 12 25) epts(30 30) would specify three initial periods of recruitment
followed by two further periods. A projection of events and power is required over
the two further periods. The initial recruitment is of 23 patients in period 1, 12 in
period 2, and 25 in period 3; in each of periods 4 and 5, we expect to recruit an
additional 30 patients. If the number of items in (or implied by expanding) numlist
is less than that specied by pts(), the nal value in numlist is replicated as neces-
sary to all subsequent periods. If epts() is not given, the default is that the mean
of the numbers of patients specied in pts() is used for all projections.
eperiods(#) species the number of future periods over which projection of power and
number of events is to be calculated. The default is eperiods(1).
stoprecruit(#) species the number of periods after which recruitment is to cease. #
must be no smaller than the number of periods of recruitment implied by pts(). The
default is stoprecruit(0), meaning to continue recruiting indenitely (no follow-up
phase).
startperiod(#) species # as the period in which to start reporting the projec-
tions of events and power. To report from the beginning of the trial, specify
startperiod(1). Note that startperiod() does not aect the period at which
the calculations are started, only how the results are reported. The default # is the
last period dened by pts().
datestart(ddmmmyyyy) signies the opening date of the trial (that is, when recruit-
ment started), for example, datestart(14oct2009). The date of the end of each
period is used to label the output and is stored in lename if using is specied.
replace allows lename to be replaced if it already exists.
artsurv options are any of the options of artsurv except recrt(), nperiod(), power(),
and n().
(Continued on next page)
394 Projection of power and events in clinical trials
5 Final comments
We have illustrated ARTPEP with a basic example. However, ARTPEP understands
the more complex options of artsurv. Therefore, complex features, including loss to
follow up, treatment crossover, and nonproportional hazards, can be allowed for in the
projection of power and events.
Sometimes it is desirable to make projections on a ner time scale than 1 year,
for example, in 3- or 6-month periods. This is easily done by adjusting the period
parameters used in ART and ARTPEP.
6 References
Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of
sample size and power for multi-arm survival trials allowing for non-uniform accrual,
non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25:
25212542.
Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for complex
sample size calculation in randomized controlled trials with a survival or a binary
outcome: Update. Stata Journal 5: 123129.
About the authors
Patrick Royston is a medical statistician with 30 years of experience, with a strong interest in
biostatistical methods and in statistical computing and algorithms. He now works in cancer
clinical trials and related research issues. Currently, he is focusing on problems of model
building and validation with survival data, including prognostic factor studies; on parametric
modeling of survival data; on multiple imputation of missing values; and on novel clinical trial
designs.
Friederike Barthel is a senior statistician in Oncology Research & Development at Glaxo-
SmithKline. Previously, she worked at the MRC Clinical Trials Unit and the Institute of
Psychiatry. Her current research interests include sample-size issues, particularly concerning
multistage, multiarm trials, microarray study analyses, and competing risks. Friederike has
taught undergraduate courses in statistics at the University of Westminster and at Kingston
University.
The Stata Journal (2010)
10, Number 3, pp. 395407
metaan: Random-eects meta-analysis
Evangelos Kontopantelis
National Primary Care
Research & Development Centre
University of Manchester
Manchester, UK
e.kontopantelis@manchester.ac.uk
David Reeves
Health Sciences Primary Care
Research Group
University of Manchester
Manchester, UK
david.reeves@manchester.ac.uk
Abstract. This article describes the new meta-analysis command metaan, which
can be used to perform xed- or random-eects meta-analysis. Besides the stan-
dard DerSimonian and Laird approach, metaan oers a wide choice of available
models: maximum likelihood, prole likelihood, restricted maximum likelihood,
and a permutation model. The command reports a variety of heterogeneity mea-
sures, including Cochrans Q, I
2
, H
2
M
, and the between-studies variance estimate
b
2
. A forest plot and a graph of the maximum likelihood function can also be
generated.
Keywords: st0201, metaan, meta-analysis, random eect, eect size, maximum
likelihood, prole likelihood, restricted maximum likelihood, REML, permutation
model, forest plot
1 Introduction
Meta-analysis is a statistical methodology that integrates the results of several inde-
pendent clinical trials in general that are considered by the analyst to be combinable
(Huque 1988). Usually, this is a two-stage process: in the rst stage, the appropriate
summary statistic for each study is estimated; then in the second stage, these statis-
tics are combined into a weighted average. Individual patient data (IPD) methods
exist for combining and meta-analyzing data across studies at the individual patient
level. An IPD analysis provides advantages such as standardization (of marker values,
outcome denitions, etc.), follow-up information updating, detailed data-checking, sub-
group analyses, and the ability to include participant-level covariates (Stewart 1995;
Lambert et al. 2002). However, individual observations are rarely available; addition-
ally, if the main interest is in mean eects, then the two-stage and the IPD approaches
can provide equivalent results (Olkin and Sampson 1998).
This article concerns itself with the second stage of the two-stage approach to meta-
analysis. At this stage, researchers can select between two main approachesthe xed-
eects (FE) or the random-eects modelin their eorts to combine the study-level
summary estimates and calculate an overall average eect. The FE model is simpler
and assumes the true eect to be the same (homogeneous) across studies. However, ho-
mogeneity has been found to be the exception rather than the rule, and some degree of
true eect variability between studies is to be expected (Thompson and Pocock 1991).
Two sorts of between-studies heterogeneity exist: clinical heterogeneity stems from dif-
c 2010 StataCorp LP st0201
396 metaan: Random-eects meta-analysis
ferences in populations, interventions, outcomes, or follow-up times, and methodological
heterogeneity stems from dierences in trial design and quality (Higgins and Green 2009;
Thompson 1994). The most common approach to modeling the between-studies variance
is the model proposed by DerSimonian and Laird (1986), which is widely used in generic
and specialist meta-analysis statistical packages alike. In Stata, the DerSimonianLaird
(DL) model is used in the most popular meta-analysis commandsthe recently up-
dated metan and the older but still useful meta (Harris et al. 2008). However, the
between-studies variance component can be estimated using more-advanced (and com-
putationally expensive) iterative techniques: maximum likelihood (ML), prole likeli-
hood (PL), and restricted maximum likelihood (REML) (Hardy and Thompson 1996;
Thompson and Sharp 1999). Alternatively, the estimate can be obtained using non-
parametric approaches, such as the permutations (PE) model proposed by Follmann
and Proschan (1999).
We have implemented these models in metaan, which performs the second stage
of a two-stage meta-analysis and oers alternatives to the DL random-eects model.
The command requires the studies eect estimates and standard errors as input. We
have also created metaeff, a command that provides support in the rst stage of the
two-stage process and complements metaan. The metaeff command calculates for each
study the eect size (standardized mean dierence) and its standard error from the
input parameters supplied by the user, using one of the models described in the Cochrane
Handbook for Systematic Reviews of Interventions (Higgins and Green 2006). For more
details, type ssc describe metaeff in Stata or see Kontopantelis and Reeves (2009).
The metaan command does not oer the plethora of options metan does for in-
putting various types of binary or continuous data. Other useful features in metan
(unavailable in metaan) include stratied meta-analysis, user-input study weights, vac-
cine ecacy calculations, the MantelHaenszel FE method, LAbbe plots, and funnel
plots. The REML model, assumed to be the best model for tting a random-eects
meta-analysis model even though this assumption has not been thoroughly investi-
gated (Thompson and Sharp 1999), has recently been coded in the updated meta-
regression command metareg (Harbord and Higgins 2008) and the new multivariate
random-eects meta-analysis command mvmeta (White 2009). However, the output
and options provided by metaan can be more useful in the univariate meta-analysis
context.
2 The metaan command
2.1 Syntax
metaan varname1 varname2
_
if
_
in
i=1
log
_
2
_
2
i
+
2
__
+
k
i=1
( y
i
)
2
2
i
+
2
_
, &
2
0 (1)
log L
(,
2
) =
1
2
_
k
i=1
log
_
2
_
2
i
+
2
__
+
k
i=1
( y
i
)
2
2
i
+
2
_
1
2
log
k
i=1
1
2
i
+
2
, &
2
0 (2)
where k is the number of studies to be meta-analyzed, y
i
and
2
i
are the eect and
variance estimates for study i, and is the overall eect estimate.
ML follows the simplest approach, maximizing (1) in a single iteration loop. A criti-
cism of ML is that it takes no account of the loss in degrees of freedom that results from
estimating the overall eect. REML derives the likelihood function in a way that adjusts
for this and removes downward bias in the between-studies variance estimator. A use-
ful description for REML, in the meta-analysis context, has been provided by Normand
(1999). PL uses the same likelihood function as ML, but uses nested iterations to take
into account the uncertainty associated with the between-studies variance estimate when
calculating an overall eect. By incorporating this extra factor of uncertainty, PL yields
E. Kontopantelis and D. Reeves 401
CIs that are usually wider than for DL and also are asymmetric. PL has been shown to
outperform DL in various scenarios (Brockwell and Gordon 2001).
The PE model (Follmann and Proschan 1999) can be described as follows: First, in
line with a null hypothesis that all true study eects are zero and observed eects are due
to random variation, a dataset of all possible combinations of observed study outcomes
is created by permuting the sign of each observed eect. Next the dl model is used to
compute an overall eect for each combination. Finally, the resulting distribution of
overall eect sizes is used to derive a CI for the observed overall eect.
Method performance is known to be aected by three factors: the number of studies
in the meta-analysis, the degree of heterogeneity in true eects, andprovided there is
heterogeneity presentthe distribution of the true eects (Brockwell and Gordon 2001).
Heterogeneity, which is attributed to clinical or methodological diversity (Higgins and
Green 2006), is a major problem researchers have to face when combining study results
in a meta-analysis. The variability that arises from dierent interventions, populations,
outcomes, or follow-up times is described by clinical heterogeneity, while dierences in
trial design and quality are accounted for by methodological heterogeneity (Thompson
1994). Traditionally, heterogeneity is tested with Cochrans Q, which provides a p-value
for the test of homogeneity, when compared with a
2
k1
distribution where k is the
number of studies (Brockwell and Gordon 2001). However, the test is known to be poor
at detecting heterogeneity because its power is low when the number of studies is small
(Hardy and Thompson 1998). An alternative measure is I
2
, which is thought to be more
informative in assessing inconsistency between studies. I
2
values of 25%, 50%, and 75%
correspond to low, moderate, and high heterogeneity, respectively (Higgins et al. 2003).
Another measure is H
2
M
, the measure least aected by the value of k. It takes values in
the [0, +) range, with 0 indicating perfect homogeneity (Mittlb ock and Heinzl 2006).
Obviously, the between-studies variance estimate
2
can also be informative about the
presence or absence of heterogeneity.
The test for heterogeneity is often used as the basis for applying an FE or a random-
eects model. However, the often low power of the Q test makes it unwise to base a
decision on the result of the test alone. Research studies, even on the same topic, can
vary on a large number of factors; hence, homogeneity is often an unlikely assumption
and some degree of variability between studies is to be expected (Thompson and Pocock
1991). Some authors recommend the adoption of a random-eects model unless there
are compelling reasons for doing otherwise, irrespective of the outcome of the test for
heterogeneity (Brockwell and Gordon 2001).
However, even though random-eects methods model heterogeneity, the performance
of the ML models (ML, REML, and PL) in situations where the true eects violate the
assumptions of a normal distribution may not be optimal (Brockwell and Gordon 2001;
Hardy and Thompson 1998; Bohning et al. 2002; Sidik and Jonkman 2007). The num-
ber of studies in the analysis is also an issue, because most meta-analysis models (includ-
ing DL, ML, REML, and PLbut not PE) are only asymptotically correct; that is, they
provide the theoretical 95% coverage only as the number of studies increases (approaches
innity). Method performance is therefore aected when the number of studies is small,
402 metaan: Random-eects meta-analysis
but the extent depends on the model (some are more susceptible), along with the degree
of heterogeneity and the distribution of the true eects (Brockwell and Gordon 2001).
4 Example
As an example, we apply the metaan command to health-risk outcome data from seven
studies. The information was collected for an unpublished meta-analysis, and the data
are available from the authors. Using the describe and list commands, we provide
details of the dataset and proceed to perform a univariate meta-analysis with metaan.
. use metaan_example
. describe
Contains data from metaan_example.dta
obs: 7
vars: 4 19 Apr 2010 12:19
size: 560 (99.9% of memory free)
storage display value
variable name type format label variable label
study str16 %16s First author and year
outcome str48 %35s Outcome description
effsize float %9.0g effect sizes
se float %9.0g SE of the effect sizes
Sorted by: study outcome
. list study outcome effsize se, noobs clean
study outcome effsize se
Bakx A, 1985 Serum cholesterol (mmol/L) -.3041526 .0958199
Campbell A, 1998 Diet .2124063 .0812414
Cupples, 1994 BMI .0444239 .090661
Eckerlund SBP -.3991309 .12079
Moher, 2001 Cholesterol (mmol/l) -.9374746 .0691572
Woolard A, 1995 Alcohol intake (g/week) -.3098185 .206331
Woolard B, 1995 Alcohol intake (g/week) -.4898825 .2001602
E. Kontopantelis and D. Reeves 403
. metaan effsize se, pl label(study) forest
Profile Likelihood method selected
Study Effect [95% Conf. Interval] % Weight
Bakx A, 1985 -0.304 -0.492 -0.116 15.09
Campbell A, 1998 0.212 0.053 0.372 15.40
Cupples, 1994 0.044 -0.133 0.222 15.20
Eckerlund -0.399 -0.636 -0.162 14.49
Moher, 2001 -0.937 -1.073 -0.802 15.62
Woolard A, 1995 -0.310 -0.714 0.095 12.01
Woolard B, 1995 -0.490 -0.882 -0.098 12.19
Overall effect (pl) -0.308 -0.622 0.004 100.00
ML method succesfully converged
PL method succesfully converged for both upper and lower CI limits
Heterogeneity Measures
value df p-value
Cochrane Q 139.81 6 0.000
I^2 (%) 91.96
H^2 11.44
value [95% Conf. Interval]
tau^2 est 0.121 0.000 0.449
Estimate obtained with Maximum likelihood - Profile likelihood provides the CI
PL method succesfully converged for both upper and lower CI limits of the tau^2
> estimate
The PL model used in the example converged successfully, as did ML, whose convergence
is a prerequisite. The overall eect is not found to be signicant at the 95% level,
and there is considerable heterogeneity across studies, according to the measures. The
model also displays a 95% CI for the between-studies variance estimate
2
(provided
that convergence is achieved, as is the case in this example). The forest plot created by
the command is displayed in gure 1.
(Continued on next page)
404 metaan: Random-eects meta-analysis
Overall effect (pl)
Woolard B, 1995
Woolard A, 1995
Moher, 2001
Eckerlund
Cupples, 1994
Campbell A, 1998
Bakx A, 1985
S
t
u
d
i
e
s
1.1 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 .1 .2 .3 .4
Effect sizes and CIs
Original weights (squares) displayed. Largest to smallest ratio: 1.30
Figure 1. Forest plot displaying PL meta-analysis
When we reexecute the analysis with the plplot(mu) and plplot(tsq) options, we
obtain the log-likelihood function plots shown in gures 2 and 3.
10
8
6
4
2
l
o
g
l
i
k
e
l
i
h
o
o
d
0 .05 .1 .15 .2
tau values
for mu fixed to the ML/PL estimate
Likelihood plot
Figure 2. Log-likelihood function plot for xed to the model estimate
E. Kontopantelis and D. Reeves 405
25
20
15
10
5
0
l
o
g
l
i
k
e
l
i
h
o
o
d
1.5 1 .5 0 .5
mu values
for tau fixed to the ML/PL estimate
Likelihood plot
Figure 3. Log-likelihood function plot for
2
xed to the model estimate
5 Discussion
The metaan command can be a useful meta-analysis tool that includes newer and, in
certain circumstances, better-performing models than the standard DL random-eects
model. Unpublished results exploring model performance in various scenarios are avail-
able from the authors. Future work will involve implementing more models in the
metaan command and embellishing the forest plot.
6 Acknowledgments
We would like to thank the authors of meta and metan for all their work and the
anonymous reviewer whose useful comments improved the article considerably.
7 References
Bohning, D., U. Malzahn, E. Dietz, P. Schlattmann, C. Viwatwongkasem, and A. Big-
geri. 2002. Some general points in estimating heterogeneity variance with the
DerSimonianLaird estimator. Biostatistics 3: 445457.
Brockwell, S. E., and I. R. Gordon. 2001. A comparison of statistical methods for
meta-analysis. Statistics in Medicine 20: 825840.
DerSimonian, R., and N. Laird. 1986. Meta-analysis in clinical trials. Controlled Clinical
Trials 7: 177188.
406 metaan: Random-eects meta-analysis
Follmann, D. A., and M. A. Proschan. 1999. Valid inference in random eects meta-
analysis. Biometrics 55: 732737.
Harbord, R. M., and J. P. T. Higgins. 2008. Meta-regression in Stata. Stata Journal 8:
493519.
Hardy, R. J., and S. G. Thompson. 1996. A likelihood approach to meta-analysis with
random eects. Statistics in Medicine 15: 619629.
. 1998. Detecting and describing heterogeneity in meta-analysis. Statistics in
Medicine 17: 841856.
Harris, R. J., M. J. Bradburn, J. J. Deeks, R. M. Harbord, D. G. Altman, and J. A. C.
Sterne. 2008. metan: Fixed- and random-eects meta-analysis. Stata Journal 8: 328.
Higgins, J. P. T., and S. Green. 2006. Cochrane Handbook for Systematic Reviews of
Interventions Version 4.2.6.
http://www2.cochrane.org/resources/handbook/Handbook4.2.6Sep2006.pdf.
. 2009. Cochrane Handbook for Systematic Reviews of Interventions Version
5.0.2. http://www.cochrane-handbook.org/.
Higgins, J. P. T., S. G. Thompson, J. J. Deeks, and D. G. Altman. 2003. Measuring
inconsistency in meta-analyses. British Medical Journal 327: 557560.
Huque, M. F. 1988. Experiences with meta-analysis in NDA submissions. Proceedings
of the Biopharmaceutical Section of the American Statistical Association 2: 2833.
Kontopantelis, E., and D. Reeves. 2009. MetaEasy: A meta-analysis add-in for Microsoft
Excel. Journal of Statistical Software 30: 125.
Lambert, P. C., A. J. Sutton, K. R. Abrams, and D. R. Jones. 2002. A comparison
of summary patient-level covariates in meta-regression with individual patient data
meta-analysis. Journal of Clinical Epidemiology 55: 8694.
Mittlb ock, M., and H. Heinzl. 2006. A simulation study comparing properties of het-
erogeneity measures in meta-analyses. Statistics in Medicine 25: 43214333.
Normand, S.-L. T. 1999. Meta-analysis: Formulating, evaluating, combining, and re-
porting. Statistics in Medicine 18: 321359.
Olkin, I., and A. Sampson. 1998. Comparison of meta-analysis versus analysis of variance
of individual patient data. Biometrics 54: 317322.
Sidik, K., and J. N. Jonkman. 2007. A comparison of heterogeneity variance estimators
in combining results of studies. Statistics in Medicine 26: 19641981.
Stewart, L. A. 1995. Practical methodology of meta-analyses (overviews) using updated
individual patient data. Statistics in Medicine 14: 20572079.
E. Kontopantelis and D. Reeves 407
Thompson, S. G. 1994. Systematic review: Why sources of heterogeneity in meta-
analysis should be investigated. British Medical Journal 309: 13511355.
Thompson, S. G., and S. J. Pocock. 1991. Can meta-analyses be trusted? Lancet 338:
11271130.
Thompson, S. G., and S. J. Sharp. 1999. Explaining heterogeneity in meta-analysis: A
comparison of methods. Statistics in Medicine 18: 26932708.
White, I. R. 2009. Multivariate random-eects meta-analysis. Stata Journal 9: 4056.
About the authors
Evangelos (Evan) Kontopantelis is a research fellow in statistics at the National Primary Care
Research and Development Centre, University of Manchester, England. His research interests
include statistical methods in health sciences with a focus on meta-analysis, longitudinal data
modeling, and large clinical database management.
David Reeves is a senior research fellow in statistics at the Health Sciences Primary Care
Research Group, University of Manchester, England. David has worked as a statistician in
health services research for nearly three decades, mainly in the elds of learning disability
and primary care. His methodological research interests include the robustness of statistical
methods, the analysis of observational studies, and applications of social network analysis
methods to health systems.
The Stata Journal (2010)
10, Number 3, pp. 408422
Regression analysis of censored data using
pseudo-observations
Erik T. Parner
University of Aarhus
Aarhus, Denmark
parner@biostat.au.dk
Per K. Andersen
University of Copenhagen
Copenhagen, Denmark
P.K.Andersen@biostat.ku.dk
Abstract. We draw upon a series of articles in which a method based on pseu-
dovalues is proposed for direct regression modeling of the survival function, the
restricted mean, and the cumulative incidence function in competing risks with
right-censored data. The models, once the pseudovalues have been computed, can
be t using standard generalized estimating equation software. Here we present
Stata procedures for computing these pseudo-observations. An example from a
bone marrow transplantation study is used to illustrate the method.
Keywords: st0202, stpsurv, stpci, stpmean, pseudovalues, time-to-event, survival
analysis
1 Introduction
Statistical methods in survival analysis need to deal with data that are incomplete
because of right-censoring; a host of such methods are available, including the Kaplan
Meier estimator, the log-rank test, and the Cox regression model. If one had complete
data, standard methods for quantitative data could be applied directly for the observed
survival time X, or methods for binary outcomes could be applied by dichotomizing
X as I(X > ) for a suitably chosen . With complete data, one could furthermore
set up regression models for any function f(X) and check such models using standard
graphical methods such as scatterplots or residuals for quantitative or binary outcomes.
One way of achieving these goals with censored survival data and with more-general
event history data (for example, competing-risks data) is to use a technique based on
pseudo-observations, as recently described in a series of articles. Thus the technique
has been studied in modeling of the survival function (Klein et al. 2007), the restricted
mean (Andersen, Hansen, and Klein 2004), and the cumulative incidence function in
competing risks (Andersen, Klein, and Rosthj 2003; Klein and Andersen 2005; Klein
2006; Andersen and Klein 2007).
The basic idea is simple. Suppose a well-behaved estimator
, for the expectation =
E{f(X)}, is availablefor example, the KaplanMeier estimator for S(t) = E{I(X >
t)}based on a sample of size n. The ith pseudo-observation (i = 1, . . . , n) for f(X) is
then dened as
i
= n
(n1)
i
where
i
is the estimator applied to the sample
of size n1, which is obtained by eliminating the ith observation from the dataset. The
pseudovalues are generated once, and the idea is to replace the incompletely observed
c 2010 StataCorp LP st0202
E. T. Parner and P. K. Andersen 409
f(X
i
) by
i
. That is,
i
may be used as an outcome variable in a regression model
or it may be used to compute residuals.
i
also may be used in a scatterplot when
assessing model assumptions (Perme and Andersen 2008; Andersen and Perme 2010).
The intuition is that, in the absence of censoring, = E{f(X)} could, obviously, be
estimated as (1/n)
i
f(X
i
), in which case the ith pseudo-observation is simply the
observed value f(X
i
). The pseudovalues are related to the jackknife residuals used in
regression diagnostics.
We present three new Stata commandsstpsurv, stpci, and stpmeanthat pro-
vide a new possibility in Stata for analyzing regression models and that generate pseu-
dovalues (respectively) for the survival function (or the cumulative distribution func-
tion, the cumulative incidence) under right-censoring, for the cumulative incidence
in competing risks, and for the restricted mean under right-censoring. Cox regression
models can be t using the pseudovalue function for survival probabilities in several
time points. Thereby, the pseudovalue method provides an alternative to Cox regres-
sion, for example, in situations where rates are not proportional. As discussed by
Perme and Andersen (2008), residuals for model checking may also be obtained from
the pseudovalues. An example based on bone marrow transplantation data is presented
to illustrate the methodology.
In section 2, we briey present the general pseudovalue approach to censored data
regression. In section 3, we present the new Stata commands; and in section 4, we show
examples of the use of the commands. Section 5 concludes with some remarks.
2 Some methodological details
2.1 The general approach
In this section, we briey introduce censored data regression based on pseudo-obser-
vations; see, for example, Andersen, Klein, and Rosthj (2003) or Andersen and Perme
(2010) for more details. Let X
1
, . . . , X
n
be independent and identically distributed
survival times, and suppose we are interested in a parameter of the form
= E{f(X)}
for some function f(). This function could be multivariate, for example,
f(X) = {f
1
(X), . . . , f
M
(X)} = {I(X >
1
), . . . , I(X >
M
)}
for a series of time points
1
, . . . ,
M
, in which case,
= (
1
, . . . ,
M
) = {S(
1
), . . . , S(
M
)}
where S() is the survival function for X. More examples are provided below. Fur-
thermore, let Z
1
, . . . , Z
n
be independent and identically distributed covariates. Also
suppose we are interested in a regression model of = E{f(X
i
)} on Z
i
for example,
a generalized linear model of the form
g[E{f(X
i
) | Z
i
}] =
T
Z
i
410 Pseudo-observations
where g() is the link function. If right-censoring prevents us from observing all the
X
i
s, then it is not simple to analyze this regression model. However, suppose
is
an approximately unbiased estimator of the marginal mean = E{f(X)} that may
be computed from the sample of right-censored observations. If f(X) = I(X > ),
then = S() may be estimated using the KaplanMeier estimator. The ith pseudo-
observation is now dened, as suggested in section 1, as
i
= n
(n 1)
i
Here
i
is the leave-one-out estimator for based on all observations but the ith:
X
j
, j = i. The idea is to replace the possibly incompletely observed f(X
i
) by
i
and to
obtain estimates of the s based on the estimating equation:
i
_
g
1
(
T
Z
i
)
_
T
V
1
i
()
_
i
g
1
(
T
Z
i
)
_
=
i
U
i
() = U() = 0 (1)
In (1), V
i
is a working covariance matrix. Graw, Gerds, and Schumacher (2009) showed
that for the examples studied in this article, E{f(X
i
) | Z
i
} = E(
i
| Z
i
), and thereby
(1) is unbiased, provided that censoring is independent of covariates; see also Andersen
and Perme (2010). A sandwich estimator is used to estimate the variance of
. Let
I() =
i
_
g
1
(
T
Z
i
)
_
T
V
1
i
()
_
g
1
(
T
Z
i
)
_
and
Var
_
U
_
__
=
i
U
i
_
_
T
U
i
_
_
then
Var
_
_
= I
_
_
1
Var
_
U
_
__
I
_
_
1
The estimator of can be shown to be asymptotically normal (Graw, Gerds, and
Schumacher 2009; Liang and Zeger 1986), and the sandwich estimator converges in
probability to the true variance. Once the pseudo-observations have been computed,
the estimators of can be obtained by using standard software for generalized estimating
equations.
The pseudo-observations may also be used to dene residuals after tting some
standard model (for example, a Cox regression model) for survival data; see Perme and
Andersen (2008) or Andersen and Perme (2010).
2.2 The survival function
Suppose we are interested in the survival function S(
j
) = Pr(X >
j
) at a grid of
time points
1
< <
M
, for a survival time X. Hence, = (
1
, . . . ,
M
) where
E. T. Parner and P. K. Andersen 411
j
= S(
j
). When M = 1, we consider the survival function at a single point in time.
Under right-censoring, the survival function is estimated by the KaplanMeier estimator
(Kaplan and Meier 1958),
S(t) =
tjt
Y
j
d
j
Y
j
where t
1
< < t
D
are the distinct event times, Y
j
is the number at risk, and d
j
is the
number of events at time t
j
. The cumulative distribution function is then estimated by
F(t) = 1
S(t). In this case, the link function of interest could be the cloglog function
cloglog {F()} = log [log {1 F()}]
which is equivalent to a Cox regression model for the survival function evaluated in .
2.3 The mean survival time
The mean time-to-event is the area under the survival curve:
=
_
0
S(u)du (2)
For right-censored data, the estimated survival function (the KaplanMeier estimator)
does not always converge down to zero. Then the mean cannot be estimated reliably
by plugging the KaplanMeier estimator into (2). An alternative to the mean is the
restricted mean, dened as the area under the survival curve up to a time <
(Klein and Moeschberger 2003), which is equal to =
=
_
0
S(u)du
An alternative mean is the conditional mean given that the event time is smaller than
,
c
=
_
0
S(u)
S()
1
S()
du
For the restricted and conditional mean, a link function of interest could be the log or
the identity.
2.4 The cumulative incidence
Under competing risks, the cumulative incidence function is estimated in a dierent
way. Suppose the event of interest has hazard function h
1
(t) and the competing risk
has hazard function h
2
(t). The cumulative incidence function for the event of interest
is then given as
F
1
(t) =
_
t
0
h
1
(u) exp
_
_
u
0
{h
1
(v) +h
2
(v)} dv
_
du
412 Pseudo-observations
If t
1
< < t
D
are the distinct times of the primary event and the competing risk
combined, Y
j
is the number at risk, d
1j
is the number of the primary events at time
t
j
, and d
2j
is the number of competing risks at time t
j
. Then the cumulative incidence
function of the primary event is estimated by
F
1
(t) =
tjt
_
d
1j
Y
j
_
ti<tj
_
Y
i
(d
1i
+d
2i
)
Y
i
_
Again the link function of interest could be cloglog corresponding to the regression
model for the competing risks cumulative incidence studied by Fine and Gray (1999).
3 The stpsurv, stpmean, and stpci commands
3.1 Syntax
Pseudovalues for the survival function, the mean survival time, and the cumulative
incidence function for competing risks are generated using the following syntaxes:
stpsurv
_
if
_
in
, at(numlist)
_
generate(string) failure
stpmean
_
if
_
in
, at(numlist)
_
generate(string) conditional
stpci varname
_
if
_
in
, at(numlist)
_
generate(string)
stpsurv, stpmean, and stpci are for use with st data. You must, therefore, stset
your data before issuing these commands. Frequency weights are allowed in the stset
command. In the stpci command for the cumulative incidence function in competing
risks, an indicator variable for the competing risks should always be specied. The
pseudovalues are by default stored in the pseudo variable when one time point is spec-
ied and are stored in variables pseudo1, pseudo2, . . . when several time points are
specied. The names of the pseudovariables are changed by the generate() option.
3.2 Options
at(numlist) species the time points in ascending order of which pseudovalues should
be computed. at() is required.
generate(string) species a variable name for the pseudo-observations. The default is
generate(pseudo).
failure generates pseudovalues for the cumulative incidence proportion, which is one
minus the survival function.
conditional species that pseudovalues for the conditional mean should be computed
instead of those for the restricted mean.
E. T. Parner and P. K. Andersen 413
4 Example data
To illustrate the pseudovalue approach, we use data on sibling-donor bone marrow
transplants matched on human leukocyte antigen (Copelan et al. 1991). The data
are available in Klein and Moeschberger (2003). The data include information on
137 transplant patients on time to death, relapse, or lost to follow-up (tdfs); the
indicators of relapse and death (relapse, trm); the indicator of treatment failure
(dfs = relapse | trm); and three factors that may be related to outcome: disease
[acute lymphocytic leukemia (ALL), low-risk acute myeloid leukemia (AML), and high-
risk AML], the FrenchAmericanBritish (FAB) disease grade for AML (fab = 1 if AML
and grade 4 or 5; 0 otherwise), and recipient age at transplant (age).
4.1 The survival function at a single time point
We will rst examine regression models for disease free survival at 530 days based on
the KaplanMeier estimator. Disease free survival probabilities for the single prognostic
factor FAB at 530 days (gure 1) can be compared using information obtained using the
Stata sts list command, which evaluates the KaplanMeier estimator.
0
.2
.4
.6
.8
1
P
r
o
b
a
b
i
l
i
t
y
0 500 1000 1500 2000
Time (days)
Fab=1
Fab=0
Survival
Figure 1. Disease free survival
Based on the sts list output below, the risk dierence (RD) for FAB is computed
as RD = 0.333 0.541 = 0.207 [95% condence interval: 0.379, 0.039] and the
relative risk (RR) for FAB is RR = 0.333/0.541 = 0.616, where FAB = 0 is chosen as the
reference group. The condence interval of the RD is based on computing the standard
error of the RD as (0.0522
2
+ 0.0703
2
)
1/2
. The condence interval for the RR is not
easily estimated using the information from the sts list command.
414 Pseudo-observations
. use bmt
. stset tdfs, failure(dfs==1)
failure event: dfs == 1
obs. time interval: (0, tdfs]
exit on or before: failure
137 total obs.
0 exclusions
137 obs. remaining, representing
83 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. sts list, at(0 530) by(fab)
failure _d: dfs == 1
analysis time _t: tdfs
Beg. Survivor Std.
Time Total Fail Function Error [95% Conf. Int.]
fab=0
0 0 0 1.0000 . . .
530 49 42 0.5408 0.0522 0.4334 0.6364
fab=1
0 0 0 1.0000 . . .
530 16 30 0.3333 0.0703 0.2018 0.4704
Note: survivor function is calculated over full data and evaluated at
indicated times; it is not calculated from aggregates shown at left.
Now we turn to the pseudovalues approach. We start by computing the pseudovalues
at 530 days using the stpsurv command. The pseudovalues are stored in the pseudo
variable.
. stpsurv, at(530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
The pseudovalues are analyzed in generalized linear models with an identity link
function and a log link function, respectively.
E. T. Parner and P. K. Andersen 415
. glm pseudo i.fab, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -96.989802
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
1.fab -.2080377 .0881073 -2.36 0.018 -.3807248 -.0353506
_cons .5406774 .0522411 10.35 0.000 .4382867 .6430681
. glm pseudo i.fab, link(log) vce(robust) eform noheader
Iteration 0: log pseudolikelihood = -123.14846
Iteration 1: log pseudolikelihood = -101.53512
Iteration 2: log pseudolikelihood = -96.991808
Iteration 3: log pseudolikelihood = -96.989802
Iteration 4: log pseudolikelihood = -96.989802
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
1.fab .6152278 .1440588 -2.07 0.038 .3887968 .9735298
The generalized linear models with an identity link function and a log link function
t the relations
p
i
= E (X
i
) =
0
+
1
FAB
i
log(p
i
) = log{E (X
i
)} =
0
+
1
FAB
i
respectively, where p
i
= S
i
(530) is disease free survival probability at 530 days for
individual i. Hence, based on the pseudovalues approach, we estimate the RD for FAB
by RD = 0.208 [95% condence interval: 0.381, 0.035] and the RR for FAB by
RR = 0.615 [95% condence interval: 0.389, 0.974]. The results are very similar to the
direct computation from the KaplanMeier using the sts list command. We now
obtain the condence interval for the RR.
Suppose we wish to compute the RR for FAB, adjusting for disease as a categorical
variable and age as a continuous variable. Using the same pseudovalues, we t the
generalized linear model.
(Continued on next page)
416 Pseudo-observations
. glm pseudo i.fab i.disease age, link(log) vce(robust) eform noheader
Iteration 0: log pseudolikelihood = -114.83229
Iteration 1: log pseudolikelihood = -93.440112
Iteration 2: log pseudolikelihood = -88.620704
Iteration 3: log pseudolikelihood = -88.601028
Iteration 4: log pseudolikelihood = -88.601013
Iteration 5: log pseudolikelihood = -88.601013
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
1.fab .6322634 .1665066 -1.74 0.082 .3773412 1.059405
disease
2 1.951343 .412121 3.17 0.002 1.289914 2.951931
3 1.005533 .3586364 0.02 0.988 .4998088 2.022965
age .9856265 .0080274 -1.78 0.075 .970018 1.001486
Patients with AML and grade 4 or 5 (FAB = 1) have a 27% reduced disease free
survival probability at 530 days, when adjusting for disease and age.
4.2 The survival function at several time points
In this example, we compute pseudovalues at ve data points roughly equally spaced on
the event scale: 50, 105, 170, 280, and 530 days. To t the model log[log{S(t | Z)}] =
log{
0
(t)}+Z, we can use the cloglog link on the pseudovalues on failure probabilities;
that is, we t a Cox regression model for the ve time points simultaneously.
. stpsurv, at(50 105 170 280 530) failure
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long
Number of obs. 137 -> 685
Number of variables 32 -> 29
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo
E. T. Parner and P. K. Andersen 417
. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id) noheader
Iteration 0: log pseudolikelihood = -468.74476
Iteration 1: log pseudolikelihood = -457.41878 (not concave)
Iteration 2: log pseudolikelihood = -406.98781
Iteration 3: log pseudolikelihood = -365.23278
Iteration 4: log pseudolikelihood = -350.7435
Iteration 5: log pseudolikelihood = -349.97156
Iteration 6: log pseudolikelihood = -349.96409
Iteration 7: log pseudolikelihood = -349.96409
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
times
2 1.114256 .3269323 3.41 0.001 .4734805 1.755032
3 1.626173 .3567925 4.56 0.000 .9268721 2.325473
4 2.004267 .3707305 5.41 0.000 1.277649 2.730885
5 2.495327 .3824645 6.52 0.000 1.745711 3.244944
1.fab .7619547 .354821 2.15 0.032 .0665183 1.457391
disease
2 -1.195542 .4601852 -2.60 0.009 -2.097489 -.2935959
3 .0036343 .3791488 0.01 0.992 -.7394838 .7467524
age .0130686 .0146629 0.89 0.373 -.0156702 .0418074
_cons -2.981582 .6066311 -4.91 0.000 -4.170557 -1.792607
The estimated survival function in this model for a patient at time t with a set of
covariates Z is S(t) = exp{
0
(t)e
Z
}, where
0
(50) = exp(2.9816) = 0.051
0
(105) = exp(2.9816 + 1.1143) = 0.155
0
(170) = exp(2.9816 + 1.6262) = 0.258
0
(280) = exp(2.9816 + 2.0043) = 0.376
0
(530) = exp(2.9816 + 2.4953) = 0.615
The model shows that patients with AML who are at low risk have better disease
free survival than ALL patients [RR = exp(1.1955) = 0.30] and that AML patients with
grade 4 or 5 FAB have a lower disease free survival [RR = exp(0.7620) = 2.14].
Without recomputing the pseudovalues, we can examine the eect of FAB over time.
(Continued on next page)
418 Pseudo-observations
. generate fab50=(fab==1 & times==1)
. generate fab105=(fab==1 & times==2)
. generate fab170=(fab==1 & times==3)
. generate fab280=(fab==1 & times==4)
. generate fab530=(fab==1 & times==5)
. glm pseudo i.times fab50-fab530 i.disease age, link(cloglog) vce(cluster id)
> noheader eform
Iteration 0: log pseudolikelihood = -471.86839
Iteration 1: log pseudolikelihood = -464.24832 (not concave)
Iteration 2: log pseudolikelihood = -406.31257
Iteration 3: log pseudolikelihood = -361.28364
Iteration 4: log pseudolikelihood = -349.90468
Iteration 5: log pseudolikelihood = -349.44613
Iteration 6: log pseudolikelihood = -349.43492
Iteration 7: log pseudolikelihood = -349.43485
Iteration 8: log pseudolikelihood = -349.43485
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
2 3.99608 2.023867 2.74 0.006 1.480921 10.78292
3 8.225489 4.601898 3.77 0.000 2.747526 24.62531
4 11.89654 6.835021 4.31 0.000 3.858093 36.68333
5 19.20116 11.25862 5.04 0.000 6.084498 60.59409
fab50 4.047315 3.227324 1.75 0.080 .8480474 19.31586
fab105 2.866106 1.433666 2.11 0.035 1.07525 7.639677
fab170 2.008426 .795497 1.76 0.078 .9240856 4.365155
fab280 2.022028 .7258472 1.96 0.050 1.000533 4.086419
fab530 2.048864 .7838364 1.87 0.061 .9679838 4.33669
disease
2 .3024683 .1368087 -2.64 0.008 .1246451 .7339808
3 .9993425 .3815547 -0.00 0.999 .4728471 2.112069
age 1.012745 .0148835 0.86 0.389 .9839899 1.04234
. test fab50=fab105=fab170=fab280=fab530
( 1) [pseudo]fab50 - [pseudo]fab105 = 0
( 2) [pseudo]fab50 - [pseudo]fab170 = 0
( 3) [pseudo]fab50 - [pseudo]fab280 = 0
( 4) [pseudo]fab50 - [pseudo]fab530 = 0
chi2( 4) = 1.73
Prob > chi2 = 0.7855
The model shows that there is no statistically signicant dierence in the FAB eect
over time (p = 0.79); that is, proportional hazards are not contraindicated for FAB.
E. T. Parner and P. K. Andersen 419
4.3 The restricted mean
For the restricted mean time to treatment failure, we use the stpmean command. To
illustrate, we look at a regression model for the mean time to treatment failure restricted
to 1,500 days. Here we use the identity link function.
. stpmean, at(1500)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variable: pseudo
. glm pseudo i.fab i.disease age, link(id) vce(robust) noheader
Iteration 0: log pseudolikelihood = -1065.6767
Robust
pseudo Coef. Std. Err. z P>|z| [95% Conf. Interval]
1.fab -352.0442 123.311 -2.85 0.004 -593.7293 -110.359
disease
2 461.1214 134.0932 3.44 0.001 198.3036 723.9391
3 78.00616 158.8357 0.49 0.623 -233.3061 389.3184
age -8.169236 5.060915 -1.61 0.106 -18.08845 1.749976
_cons 895.118 159.1586 5.62 0.000 583.173 1207.063
Here we see that low-risk AML patients have the longest restricted mean life, namely,
461.1 days longer than ALL patients within 1,500 days.
4.4 Competing risks
For the cumulative incidence function, we use the stpci command to compute the
pseudovalues. To illustrate, we use the complementary loglog model to the relapse
cumulative incidence evaluated at 50, 105, 170, 280, and 530 days. The event of interest
is death in remission. Here relapse is a competing event.
. stset tdfs, failure(trm==1)
failure event: trm == 1
obs. time interval: (0, tdfs]
exit on or before: failure
137 total obs.
0 exclusions
137 obs. remaining, representing
42 failures in single record/single failure data
107138 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 2640
. generate compet=(trm==0 & relapse==1)
420 Pseudo-observations
. stpci compet, at(50 105 170 280 530)
Computing pseudo observations (progress dots indicate percent completed)
1 2 3 4 5
.................................................. 50
.................................................. 100
Generated pseudo variables: pseudo1-pseudo5
. generate id=_n
. reshape long pseudo, i(id) j(times)
(note: j = 1 2 3 4 5)
Data wide -> long
Number of obs. 137 -> 685
Number of variables 33 -> 30
j variable (5 values) -> times
xij variables:
pseudo1 pseudo2 ... pseudo5 -> pseudo
. fvset base none times
. glm pseudo i.times i.fab i.disease age, link(cloglog) vce(cluster id)
> noheader noconst eform
Iteration 0: log pseudolikelihood = -462.96735 (not concave)
Iteration 1: log pseudolikelihood = -348.27329
Iteration 2: log pseudolikelihood = -221.69131
Iteration 3: log pseudolikelihood = -198.31467
Iteration 4: log pseudolikelihood = -197.38196
Iteration 5: log pseudolikelihood = -197.37526
Iteration 6: log pseudolikelihood = -197.37524
(Std. Err. adjusted for 137 clusters in id)
Robust
pseudo exp(b) Std. Err. z P>|z| [95% Conf. Interval]
times
1 .0286012 .0292766 -3.47 0.001 .0038467 .21266
2 .0791623 .0547411 -3.67 0.000 .0204131 .306993
3 .1261608 .0823572 -3.17 0.002 .0350965 .4535083
4 .1781601 .1117597 -2.75 0.006 .0521017 .6092124
5 .2383869 .1488814 -2.30 0.022 .0700932 .8107537
1.fab 3.104153 1.52811 2.30 0.021 1.182808 8.146518
disease
2 .1708985 .1154623 -2.61 0.009 .0454622 .6424309
3 .7829133 .466016 -0.41 0.681 .2438093 2.514068
age 1.014382 .0258272 0.56 0.575 .9650037 1.066286
Here we are modeling C(t | Z) = 1 exp{
0
(t)e
Z
}. Positive values of for a
covariate suggest a larger cumulative incidence for patients with Z = 1. The model
suggests that the low-risk AML patients have the smallest risk of death in remission and
the AML FAB 4/5 patients have the highest risk of death in remission.
E. T. Parner and P. K. Andersen 421
5 Conclusion
The pseudovalue method is a versatile tool for regression analysis of censored time-to-
event data. We have implemented the method for regression analysis of the survival
under right-censoring, for the cumulative incidence function under possible competing
risks, and for the restricted and conditional mean waiting time. Similar SAS macros and
R functions were presented by Klein et al. (2008).
6 References
Andersen, P. K., M. G. Hansen, and J. P. Klein. 2004. Regression analysis of restricted
mean survival time based on pseudo-observations. Lifetime Data Analysis 10: 335
350.
Andersen, P. K., and J. P. Klein. 2007. Regression analysis for multistate models
based on a pseudo-value approach, with applications to bone marrow transplantation
studies. Scandinavian Journal of Statistics 34: 316.
Andersen, P. K., J. P. Klein, and S. Rosthj. 2003. Generalised linear models for
correlated pseudo-observations, with applications to multi-state models. Biometrika
90: 1527.
Andersen, P. K., and M. P. Perme. 2010. Pseudo-observations in survival analysis.
Statistical Methods in Medical Research 19: 7199.
Copelan, E. A., J. C. Biggs, J. M. Thompson, P. Crilley, J. Szer, J. P. Klein, N. Kapoor,
B. R. Avalos, I. Cunningham, K. Atkinson, K. Downs, G. S. Harmon, M. B. Daly,
I. Brodsky, S. I. Bulova, and P. J. Tutschka. 1991. Treatment for acute myelocytic
leukemia with allogeneic bone marrow transplantation following preparation with
BuCy2. Blood 78: 838843.
Fine, J. P., and R. J. Gray. 1999. A proportional hazards model for the subdistribution
of a competing risk. Journal of the American Statistical Association 94: 496509.
Graw, F., T. A. Gerds, and M. Schumacher. 2009. On pseudo-values for regression
analysis in competing risks models. Lifetime Data Analysis 15: 241255.
Kaplan, E. L., and P. Meier. 1958. Nonparametric estimation from incomplete obser-
vations. Journal of the American Statistical Association 53: 457481.
Klein, J. P. 2006. Modeling competing risks in cancer studies. Statistics in Medicine
25: 10151034.
Klein, J. P., and P. K. Andersen. 2005. Regression modeling of competing risks data
based on pseudovalues of the cumulative incidence function. Biometrics 61: 223229.
Klein, J. P., M. Gerster, P. K. Andersen, S. Tarima, and M. P. Perme. 2008. SAS and R
functions to compute pseudo-values for censored data regression. Computer Methods
and Programs in Biomedicine 89: 289300.
422 Pseudo-observations
Klein, J. P., B. Logan, M. Harho, and P. K. Andersen. 2007. Analyzing survival curves
at a xed point in time. Statistics in Medicine 26: 45054519.
Klein, J. P., and M. L. Moeschberger. 2003. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd ed. New York: Springer.
Liang, K.-Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear
models. Biometrika 73: 1322.
Perme, M. P., and P. K. Andersen. 2008. Checking hazard regression models using
pseudo-observations. Statistics in Medicine 27: 53095328.
About the authors
Erik T. Parner has a PhD in statistics from the University of Aarhus. He is an associate profes-
sor of biostatistics at the University of Aarhus. His research elds are time-to-event analysis,
statistical methods in epidemiology and genetics, and the etiology and changing prevalence of
autism.
Per K. Andersen has a PhD in statistics and a DrMedSci degree in biostatistics, both from the
University of Copenhagen. He is a professor of biostatistics at the University of Copenhagen.
His main research elds are time-to-event analysis and statistical methods in epidemiology.
The Stata Journal (2010)
10, Number 3, pp. 423457
Estimation of quantile treatment eects with
Stata
Markus Fr olich
Universitat Mannheim and
Institute for the Study of Labor
Bonn, Germany
froelich@uni-mannheim.de
Blaise Melly
Department of Economics
Brown University
Providence, RI
blaise melly@brown.edu
Abstract. In this article, we discuss the implementation of various estimators
proposed to estimate quantile treatment eects. We distinguish four cases involv-
ing conditional and unconditional quantile treatment eects with either exogenous
or endogenous treatment variables. The introduced ivqte command covers four
dierent estimators: the classical quantile regression estimator of Koenker and
Bassett (1978, Econometrica 46: 3350) extended to heteroskedasticity consis-
tent standard errors; the instrumental-variable quantile regression estimator of
Abadie, Angrist, and Imbens (2002, Econometrica 70: 91117); the estimator for
unconditional quantile treatment eects proposed by Firpo (2007, Econometrica
75: 259276); and the instrumental-variable estimator for unconditional quantile
treatment eects proposed by Frolich and Melly (2008, IZA discussion paper 3288).
The implemented instrumental-variable procedures estimate the causal eects for
the subpopulation of compliers and are only well suited for binary instruments.
ivqte also provides analytical standard errors and various options for nonpara-
metric estimation. As a by-product, the locreg command implements local linear
and local logit estimators for mixed data (continuous, ordered discrete, unordered
discrete, and binary regressors).
Keywords: st0203, ivqte, locreg, quantile treatment eects, nonparametric regres-
sion, instrumental variables
1 Introduction
Ninety-ve percent of applied econometrics is concerned with mean eects, yet distri-
butional eects are no less important. The distribution of the dependent variable may
change in many ways that are not revealed or are only incompletely revealed by an exam-
ination of averages. For example, the wage distribution can become more compressed or
the upper-tail inequality may increase while the lower-tail inequality decreases. There-
fore, applied economists and policy makers are increasingly interested in distributional
eects. The estimation of quantile treatment eects (QTEs) is a powerful and intuitive
tool that allows us to discover the eects on the entire distribution. As an alternative
motivation, median regression is often preferred to mean regression to reduce suscep-
tibility to outliers. Hence, the estimators presented below may thus be particularly
appealing with noisy data such as wages or earnings. In this article, we provide a brief
survey over recent developments in this literature and a description of the new ivqte
command, which implements these estimators.
c 2010 StataCorp LP st0203
424 Estimation of quantile treatment eects with Stata
Depending on the type of endogeneity of the treatment and the denition of the
estimand, we can dene four dierent cases. We distinguish between conditional and
unconditional eects and whether selection is on observables or on unobservables. Con-
ditional QTEs are dened conditionally on the value of the regressors, whereas uncon-
ditional eects summarize the causal eect of a treatment for the entire population.
Selection on observables is often referred to as a matching assumption or as exogenous
treatment choice (that is, exogenous conditional on X). In contrast, we refer to selection
on unobservables as endogenous treatment choice.
First, if we are interested in conditional QTEs and we assume that the treatment
is exogenous (conditional on X), we can use the quantile regression estimators pro-
posed by Koenker and Bassett (1978). Second, if we are interested in conditional
QTEs but the treatment is endogenous, the instrumental-variable (IV) estimator of
Abadie, Angrist, and Imbens (2002) may be applied. Third, for estimating uncondi-
tional QTEs with exogenous treatment, various approaches have been suggested, for
example, Firpo (2007), Fr olich (2007a), and Melly (2006). Currently, the weighting
estimator of Firpo (2007) is implemented. Finally, unconditional QTE in the presence
of an endogenous treatment can be estimated with the technique of Fr olich and Melly
(2008). The estimators for the unconditional treatment eects do not rely on any (para-
metric) functional forms assumptions. On the other hand, for the conditional treatment
eects,
n convergence rate can only be obtained with a parametric restriction. Be-
cause estimators aected by the curse of dimensionality are of less interest to the applied
economist, we will discuss only parametric (linear) estimators for estimating conditional
QTEs.
The implementation of most of these estimators requires the preliminary nonpara-
metric estimation of some kind of (instrument) propensity scores. We use nonparametric
linear and logistic regressions to estimate these propensity scores. As a by-product, we
also oer the locreg command for researchers interested only in these nonparametric
regression estimators. We allow for dierent types of regressors, including continuous,
ordered discrete, unordered discrete, and binary variables. A cross-validation routine is
implemented for choosing the smoothing parameters.
This article only discusses the implementation of the proposed estimators and the
syntax of the commands. It draws heavily on the more technical discussion in the
original articles, and the reader is referred to those articles for more background on,
and formal derivations of, some of the properties of the estimators described here.
The contributions to this article and the related commands are manyfold. We provide
new standardized commands for the estimators proposed in Abadie, Angrist, and Imbens
(2002);
1
Firpo (2007); and Fr olich and Melly (2008); and estimators of their analytical
standard errors. For the conditional exogenous case, we provide heteroskedasticity
consistent standard errors. The estimator of Koenker and Bassett (1978) has already
been implemented in Stata with the qreg command, but its estimated standard errors
1. Joshua Angrist provides codes in Matlab to replicate the empirical results of Abadie, Angrist, and
Imbens (2002). Our codes for this estimator partially build on his codes.
M. Fr olich and B. Melly 425
are not consistent in the presence of heteroskedasticity. The ivqte command thus
extends upon qreg in providing analytical standard errors for heteroskedastic errors.
At a higher level, locreg implements nonparametric estimation with both cate-
gorical and continuous regressors as suggested by Racine and Li (2004). Finally, we
incorporate cross-validation procedures to choose the smoothing parameters.
The next section outlines the denition of the estimands, the possible identica-
tion approaches, and the estimators. Section 3 describes the ivqte command and
its various options, and contains simple applications to illustrate how ivqte can be
used. Appendix A describes somewhat more technical aspects for the estimation of the
asymptotic variance matrices. Appendix B describes the nonparametric estimators used
internally by ivqte and the additional locreg command.
2 Framework, assumptions, and estimators
We consider the eect of a binary treatment variable D on a continuous outcome variable
Y . Let Y
1
i
and Y
0
i
be the potential outcomes of individual i. Hence, Y
1
i
would be realized
if individual i were to receive treatment 1, and Y
0
i
would be realized otherwise. Y
i
is
the observed outcome, which is Y
i
Y
1
i
D
i
+Y
0
i
(1 D
i
).
In this article, we identify and estimate the entire distribution functions of Y
1
and
Y
0
.
2
Because QTEs are an intuitive way to summarize the distributional impact of a
treatment, we focus our attention especially on them.
We often observe not only the outcome and the treatment variables but also some
characteristics X.
3
We can therefore either dene the QTEs conditionally on the co-
variates or unconditionally. In addition, we have to deal with endogenous treatment
choice. We distinguish between the case where selection is only on observables and the
case where selection is also on unobservables.
2.1 Conditional exogenous QTEs
We start with the standard model for linear quantile regression, which is a model for
conditional eects and where one assumes selection on observables. We assume that Y
is a linear function in X and D.
Assumption 1. Linear model for potential outcomes
Y
d
i
= X
i
+d
+
i
and Q
i
= 0
for i = 1, . . . , n and d (0, 1). Q
i
refers to the th quantile of the unobserved random
variable
i
.
and
represents
the conditional QTEs at quantile .
2. In the case with endogenous treatment, we identify the potential outcomes only for compliers, as
dened later.
3. If we do not observe covariates, then conditional and unconditional QTEs are identical and the
estimators simplify accordingly.
426 Estimation of quantile treatment eects with Stata
Clearly, this linearity assumption is not sucient for identication of QTEs because
the observed D
i
may be correlated with the error term
i
. We assume that both D and
X are exogenous.
Assumption 2. Selection on observables with exogenous X
(D, X)
Assumptions 1 and 2 together imply that Q
Y |X,D
= X
+ D
) = arg min
,
(Y
i
X
i
D
i
) (1)
where
) = arg min
,
W
KB
i
(Y
i
X
i
D
i
)
where the weights W
KB
i
are all equal to one.
2.2 Conditional endogenous QTEs
In many applications, the treatment D is self selected and potentially endogenous. We
may not be able to observe all covariates to make assumption 2 valid. In this case,
the traditional quantile regression estimator will be biased, and we need to use an IV
identication strategy to recover the true eects. We assume that we observe a binary
instrument Z and can therefore dene two potential treatments denoted by D
z
.
4
We
use the following IV assumption as in Abadie, Angrist, and Imbens (2002).
5
4. If the instrument is nonbinary, it must be transformed into a binary variable. See Frolich and Melly
(2008).
5. An alternative approach is given in Chernozhukov and Hansen (2005), who rely on a monotonic-
ity/rank invariance assumption in the outcome equation.
M. Fr olich and B. Melly 427
Assumption 3. IV
For almost all values of X
_
Y
0
, Y
1
, D
0
, D
1
_
Z |X
0 < Pr (Z = 1 |X) < 1
E (D
1
|X) = E (D
0
|X)
Pr (D
1
D
0
|X) = 1
This assumption is well known and requires monotonicity (that is, the nonexistence of
deers) in addition to a conditional independence assumption on the IV. Individuals
with D
1
> D
0
are referred to as compliers, and treatment eects can be identied only
for this group because the always- and never-participants cannot be induced to change
treatment status by hypothetical movements of the instrument.
Abadie, Angrist, and Imbens (2002) (AAI) impose assumption 3. Furthermore, they
require assumption 1 to hold for the compliers (that is, those observations with D
1
>
D
0
). They show that the conditional QTE,
IV
,
IV
) = arg min
,
W
AAI
i
(Y
i
X
i
D
i
) (2)
W
AAI
i
= 1
D
i
(1 Z
i
)
1 Pr (Z = 1 |X
i
)
(1 D
i
) Z
i
Pr (Z = 1 |X
i
)
The intuition for these weights can be given in two steps. First, by assumption 3,
6
_
Y
0
, Y
1
, D
0
, D
1
_
Z |X
=
_
Y
0
, Y
1
_
Z |X, D
1
> D
0
=
_
Y
0
, Y
1
_
D|X, D
1
> D
0
This means that any observed relationship between D and Y has a causal interpretation
for compliers. To use this result, we have to nd compliers in the population. This is
done in the following average sense by the weights W
AAI
i
:
7
E
_
W
AAI
i
(Y
i
X
i
D
i
)
_
= Pr (D
1
> D
0
) E {
(Y
i
X
i
D
i
) |D
1
> D
0
}
Intuitively, this result holds because W
AAI
i
= 1 for the compliers and because
E
_
W
AAI
i
|D
i,1
= D
i,0
= 0
_
= E
_
W
AAI
i
|D
i,1
= D
i,0
= 1
_
= 0.
A preliminary estimator for Pr (Z = 1 |X
i
) is needed to implement this estima-
tor. ivqte uses the local logit estimator described in appendix B.
8
A problem with
6. This is the result of lemma 2.1 in Abadie, Angrist, and Imbens (2002).
7. This is a special case of theorem 3.1.a in Abadie (2003).
8. In their original article, Abadie, Angrist, and Imbens (2002) use a series estimator instead of a
local estimator as in ivqte. Nevertheless, one can also use series estimation or, in fact, any other
method to estimate the propensity score by rst generating a variable containing the estimated
propensity score and informing ivqte via the phat() option that the propensity-score estimate is
supplied by the user.
428 Estimation of quantile treatment eects with Stata
estimator (2) is that the optimization problem is not convex because some of the
weights are negative while others are positive. Therefore, this estimator has not been
implemented. Instead, ivqte implements the AAI estimator with positive weights.
Abadie, Angrist, and Imbens (2002) have shown that as an alternative to W
AAI
i
, one
can use the weights
W
AAI+
i
= E
_
W
AAI
|Y
i
, D
i
, X
i
_
(3)
instead, which are always positive. Because these weights are unknown, ivqte uses local
linear regression to estimate W
AAI+
i
; see appendix B. Some of these estimated weights
might be negative in nite samples, which are then set to zero.
9
2.3 Unconditional QTEs
The two estimators presented above focused on conditional treatment eects, that is,
conditional on a set of variables X. We will now consider unconditional QTEs, which
have some advantages over the conditional eects. The unconditional QTE (for quantile
) is given by
= Q
Y
1 Q
Y
0
First, the denition of the unconditional QTE does not change when we change the
set of covariates X. Although we aim to estimate the unconditional eect, we still
use the covariates X for two reasons. On the one hand, we often need covariates to
make the identication assumptions more plausible. On the other hand, covariates can
increase eciency. Therefore, covariates X are included in the rst-step regression and
then integrated out. However, the denition of the eects is not a function of the
covariates. This is an advantage over the conditional QTE, which changes with the set
of conditioning variables even if the covariates are not needed to satisfy the selection on
observables or the IV assumptions.
A very simple example illustrates this advantage. Assume that the treatment D
has been completely randomized and is therefore independent both from the potential
outcomes as well as from the covariates. A simple comparison of the distribution of Y in
the treated and nontreated populations has a causal interpretation in such a situation.
For eciency reasons, however, we may wish to include covariates in the estimation. If
we are interested in mean eects, it is well known that including in a linear regression
covariates that are independent from the treatment leaves the estimated treatment eect
asymptotically unchanged. This property is lost for QTEs! Including covariates that
are independent from the treatment can change the limit of the estimated conditional
QTEs. On the other hand, it does not change the unconditional treatment eects if
the assumptions of the model are satised for both sets of covariates, which is trivially
satised in our randomized example.
A second advantage of unconditional eects is that they can be estimated consis-
tently at the
n rate without any parametric restrictions, which is not possible for
conditional eects. For the conditional QTE, we therefore only implemented estimators
9. Again, other estimators may be used with ivqte. The weights are rst estimated by the user and
then supplied via the what() option.
M. Fr olich and B. Melly 429
with a parametric restriction. The following estimators of the unconditional QTE are
entirely nonparametric, and we will no longer invoke assumption 1. This is an important
advantage because parametric restrictions are often dicult to justify from a theoretical
point of view. In addition, assumption 1 restricts the QTE to be the same independently
from the value of X. Obviously, interaction terms may be included, but the eects in
the entire population are often more interesting than many eects for dierent covariate
combinations.
The interpretation of the unconditional eects is slightly dierent from the interpre-
tation of the conditional eects, even if the conditional QTE is independent from the
value of X. This is because of the denition of the quantile. For instance, if we are in-
terested in a low quantile, the conditional QTE will summarize the eect for individuals
with relatively low Y even if their absolute level of Y is high. The unconditional QTE,
on the other hand, will summarize the eect with a relatively low absolute Y .
Finally, the conditional and unconditional QTEs are trivially the same in the absence
of covariates. They are also the same if the eect is the same independent of the value
of the covariates and of the value of the quantile . This is often called the location
shift model because the treatment aects only the location of the distribution of the
potential outcomes.
2.4 Unconditional endogenous QTEs
We consider rst the case of an endogenous treatment with a binary IV Z. This includes
the situation with exogenous treatment as a special case when we use Z = D.
Fr olich and Melly (2008) showed that
IV
) = arg min
,
W
FM
i
(Y
i
D
i
) (4)
W
FM
i
=
Z
i
Pr (Z = 1 |X
i
)
Pr (Z = 1 |X
i
) {1 Pr (Z = 1 |X
i
)}
(2D
i
1)
This is a bivariate quantile regressor estimator with weights. One can easily see that
IV
+
IV
is identied only from the D = 1 observations and that
IV
is identied
only from the D = 0 observations. Therefore, this estimator is equivalent to using
two univariate weighted quantile regressions separately for the D = 1 and the D = 0
observations.
10
There are two dierences between (4) and (2): The covariates are not included in the
weighted quantile regression in (4), and the weights are dierent.
11
One might be think-
10. The previous expression is numerically identical to b
IV
= arg min
q
0
P
i:D
i
=0
W
FM
i
(Y
i
q
0
) and
b
IV
+
b
IV
= arg min
q
1
P
i:D
i
=1
W
FM
i
(Y
i
q
1
), from which we thus obtain
b
IV
via two univariate
quantile regressions.
11. The weights W
FM
i
were suggested in theorem 3.1.b and 3.1.c of Abadie (2003) for a general purpose.
Frolich and Melly (2008) used these weights to estimate unconditional QTEs.
430 Estimation of quantile treatment eects with Stata
ing about running a weighted quantile regression of Y on a constant and D by using the
weights W
AAI
i
. For that purpose, however, the weights of Abadie, Angrist, and Imbens
(2002) are not correct as shown in Fr olich and Melly (2008). This estimator would es-
timate the dierence between the quantile of Y
1
for the treated compliers and the
quantile of Y
0
for the nontreated compliers, which is not meaningful in general. How-
ever, weights W
AAI
i
could be used to estimate unconditional eects in the special case
when the IV is independent of X such that Pr (Z = 1 |X) is not a function of X.
On the other hand, if one is interested in estimating conditional QTE using a para-
metric specication, the weights W
FM
i
could be used, as well. Hence, although not
developed for this case, the weights W
FM
i
can be used to identify conditional QTEs. It
is not clear whether W
FM
i
or W
AAI
i
will be more ecient. For estimating conditional
eects, both are inecient anyway because they do not incorporate the conditional
density function of the error term at the quantile.
Intuitively, the dierence between the weights W
AAI
i
and W
FM
i
can be explained as
follows: They both nd the compliers in the average sense discussed above. However,
only W
FM
i
simultaneously balances the distribution of the covariates between treated
and nontreated compliers. Therefore, W
AAI
i
can be used only in combination with a
conditional model because there is no need to balance covariates in such a case. It can
also be used without a conditional model when the treated and nontreated compliers
have the same covariate distribution. W
FM
i
, on the other hand, can be used with or
without a conditional model.
A preliminary estimator for Pr (Z = 1 |X
i
) is needed to implement this estimator.
ivqte uses the local logit estimator described in appendix B. The optimization problem
(4) is neither convex nor smooth. However, only two parameters have to be estimated.
In fact, one can easily show that the estimator can be written as two univariate quantile
regressions, which can easily be solved despite the nonsmoothness; see the previous
footnotes. This is the way ivqte proceeds when the positive option is not activated.
12
An alternative to solving this nonconvex problem consists in using the weights
W
FM+
i
= E
_
W
FM
|Y
i
, D
i
_
(5)
which are always positive. ivqte estimates these weights by local linear regression if
the positive option has been activated. Again, estimated negative weights will be set
to zero.
13
12. More precisely, ivqte solves the convex problem for the distribution function, and then mono-
tonizes the estimated distribution function using the method of Chernozhukov, Fernandez-Val, and
Galichon (2010), and nally inverts it to obtain the quantiles. The parameters chosen in this way
solve the rst-order conditions of the optimization problem, and therefore, the asymptotic results
apply to them.
13. If one is interested in average treatment eects, Frolich (2007b) has proposed an estimator for aver-
age treatment eects based on the same set of assumptions. This estimator has been implemented
in Stata in the command nplate, which can be downloaded from the websites of the authors of
this article.
M. Fr olich and B. Melly 431
2.5 Unconditional exogenous QTEs
Finally, we consider the case where the treatment is exogenous, conditional on X. We
assume that X contains all confounding variables, which we denote as the selection on
observables assumption. We also have to assume that the support of the covariates is
the same independent of the treatment, because in a nonparametric model, we cannot
extrapolate the conditional distribution outside the support of the covariates.
Assumption 4. Selection on observables and common support
(Y
0
, Y
1
)D|X
0 < Pr (D = 1 |X) < 1
Assumption 4 identies the unconditional QTE, as shown in Firpo (2007), Fr olich
(2007a), and Melly (2006). The estimator of Firpo (2007) is a special case of (4), when
D is used as its own instrument. The weighting estimator for
therefore is
( ,
) = arg min
,
W
F
i
(Y
i
D
i
) (6)
W
F
i
=
D
i
Pr (D = 1 |X
i
)
+
1 D
i
1 Pr (D = 1 |X
i
)
This is a traditional propensity-score weighting estimator, also known as inverse
probability weighting. A preliminary estimator for Pr (D = 1 |X
i
) is needed to imple-
ment this estimator. ivqte uses the local logit estimator described in appendix B.
3 The ivqte command
3.1 Syntax
The syntax of ivqte is as follows:
ivqte depvar
_
indepvars
(treatment
_
= instrument
)
_
if
_
in
_
,
quantiles(numlist) continuous(varlist) dummy(varlist) unordered(varlist)
aai linear mata opt kernel(kernel) bandwidth(#) lambda(#) trim(#)
positive pbandwidth(#) plambda(#) pkernel(kernel) variance
vbandwidth(#) vlambda(#) vkernel(kernel) level(#)
generate p(newvarname
_
, replace
) generate w(newvarname
_
,
replace
) phat(varname) what(varname)
15
8
35
8
z
2
3
4
`
1 z
2
1 (z < 1)
K (z) =
175
64
525
32
z
2
+
5775
320
z
4
3
4
`
1 z
2
1 (z < 1)
And here are the formulas for Gaussian of order 4, 6, and 8, respectively:
K (z) =
1
2
`
3 z
2
(z)
K (z) =
1
8
`
15 10z
2
+z
4
(z)
K (z) =
1
48
`
105 105z
2
+ 21z
4
z
6
(z)
M. Fr olich and B. Melly 435
bandwidth(#) sets the bandwidth h used to smooth over the continuous variables in the
estimation of the propensity score. The continuous regressors are rst orthogonalized
such that their covariance matrix is the identity matrix. The bandwidth must be
strictly positive. If the bandwidth h is missing, an innite bandwidth h = is
used. The default value is innity. If the bandwidth h is innity and the parameter
is one, a global model (linear or logit) is estimated without any local smoothing.
The cross-validation procedure implemented in locreg can be used to guide the
choice of the bandwidth. Because the optimal bandwidth converges at a faster rate
than the cross-validated bandwidth, the robustness of the results with respect to a
smaller bandwidth should be examined.
lambda(#) sets the used to smooth over the dummy and unordered discrete variables
in the estimation of the propensity score. It must be between 0 and 1. A value
of 0 implies that only observations within the cell dened by all discrete regressors
are used. The default is lambda(1), which corresponds to global smoothing. If
the bandwidth h is innity and = 1, a global model (linear or logit) is estimated
without any local smoothing. The cross-validation procedure implemented in locreg
can be used to guide the choice of lambda. Again the robustness of the results with
respect to a smaller bandwidth should be examined.
Estimation of the weights
trim(#) controls the amount of trimming. All observations with an estimated propen-
sity score less than trim() or greater than 1 trim() are trimmed and not used
further by the estimation procedure. This prevents giving very high weights to single
observations. The default is trim(0.001). This option is not useful for the Koenker
and Bassett (1978) estimator, where no propensity score is estimated.
positive is used only with the Fr olich and Melly (2008) estimator. If it is activated, the
positive weights W
FM+
i
dened in (5) are estimated by the projection of the weights
W
FM
on the dependent and the treatment variable. Weights W
FM+
are estimated
by nonparametric regression on Y , separately for the D = 1 and the D = 0 samples.
After the estimation, negative estimated weights in
W
FM+
i
are set to zero.
pbandwidth(#), plambda(#), and pkernel(kernel) are used to calculate the positive
weights. These options are useful only for the Abadie, Angrist, and Imbens (2002)
estimator, which can be activated via the aai option, and for the Fr olich and Melly
(2008) estimator, but only when the positive option has been activated to estimate
W
FM+
. pkernel() and pbandwidth() are used to calculate the positive weights if
the positive option has been selected. They are dened similarly to kernel(),
bandwidth(), and lambda(). When pkernel(), pbandwidth(), and plambda() are
not specied, the values given in kernel(), bandwidth(), and lambda() are taken
as default.
The positive weights are always estimated by local linear regression. After esti-
mation, negative estimated weights are set to zero. The smoothing parameters
436 Estimation of quantile treatment eects with Stata
pbandwidth() and plambda() are in principle as important as the other smoothing
parameters bandwidth() and lambda(), and it is worth inspecting the robustness
of the results with respect to these parameters. Cross-validation can also be used to
guide these choices.
Inference
variance activates the estimation of the variance. By default, no standard errors are
estimated because the estimation of the variance can be computationally demanding.
Except for the classical linear quantile regression estimator, it requires the estima-
tion of many nonparametric functions. This option should not be activated if you
bootstrap the results unless you bootstrap t-values to exploit possible asymptotic
renements.
vbandwidth(#), vlambda(#), and vkernel(kernel) are used to calculate the vari-
ance if the variance option has been selected. They are dened similarly to
bandwidth(), lambda(), and kernel(). They are used only to estimate the vari-
ance. A quick and dirty estimate of the variance can be obtained by setting
vbandwidth() to innity and vlambda() to 1, which is much faster than any other
choice. When vkernel(), vbandwidth(), or vlambda() is not specied, the values
given in kernel(), bandwidth(), and lambda() are taken as default.
level(#) species the condence level, as a percentage, for condence intervals. The
default is level(95) or as set by set level.
Saved propensity scores and weights
generate p(newvarname
_
, replace
and
= (
n(
) N
_
0, J
1
J
1
_
where J
= E
_
f
Y |X
(X
) XX
_
and
= (1 ) E
_
XX
_
. The term
is
straightforward to estimate by (1 ) n
1
X
i
X
i
. We estimate J
by the kernel
method of Powell (1986),
=
1
nh
n
k
_
Y
i
X
h
n
_
X
i
X
i
22. See, for example, Koenker (2005).
M. Fr olich and B. Melly 447
where k is a univariate kernel function and h
n
is a bandwidth sequence. In the actual
implementation, we use a normal kernel and the bandwidth suggested by Hall and
Sheather (1988),
h
n
= n
1/3
1
(1 level/2)
2/3
_
1.5
_
1
()
_
2
2{
1
()}
2
+ 1
_
1/3
where level is the level for the intended condence interval, and and are the normal
density and distribution functions, respectively. This estimator of the asymptotic vari-
ance is consistent under heteroskedasticity, which is in contrast to the ocial Stata com-
mand for quantile regression, qreg. This is important because quantile regression only
becomes interesting when the errors are not independent and identically distributed.
A.2 Conditional endogenous QTEs
The asymptotic distribution of the IV quantile regression estimator dened in (2) is
given by
n(
IV
) N
_
0, I
1
I
1
_
(7)
where I
= E
_
f
Y |X,D1>D0
(X
) XX
|D
1
> D
0
_
Pr (D
1
> D
0
) and
= E (
)
with = W
AAI
m
(X, Y ) = { 1 (Y X
(X, Y )
_
_
D(1 Z) / {1 Pr (Z = 1 |X)}
2
_
+
(1 D) Z/Pr (Z = 1 |X)
2
_
|X
_
.
We estimate these elements as
=
1
nh
n
W
AAI+
i
k
_
Y
i
X
IV
h
n
_
X
i
X
i
where
W
AAI+
i
are estimates of the projected weights. For the kernel function in the
previous regression, we use an Epanechnikov kernel and h
n
= n
0.2
_
Var (Y
i
X
IV
)
as proposed by Abadie, Angrist, and Imbens (2002).
23
Furthermore,
H (X
i
) is estimated by the local linear regression of
{ 1 (Y
i
X
IV
< 0)} X
i
D
i
(1 Z
i
)
_
1
Pr (Z = 1 |X
i
)
_
2
+
(1 D
i
) Z
i
Pr (Z = 1 |X)
2
on X
i
This nonparametric regression is controlled by the options vkernel(), vbandwidth(),
and vlambda() in ivqte. With these ingredients, we calculate
23. In principle, the same kernel and bandwidth as those for quantile regression can be used. These
choices were made to replicate the results of Abadie, Angrist, and Imbens (2002).
448 Estimation of quantile treatment eects with Stata
i
=
W
AAI
i
{ 1 (Y
i
X
IV
< 0)} X
i
+
H (X
i
)
_
Z
i
Pr (Z = 1 |X
i
)
_
=
1
n
i
where
W
AAI
i
are estimates of the weights.
A.3 Unconditional exogenous QTEs
The asymptotic distribution of the estimator dened in (6) is given by
n
_
_
N (0, V)
with
V =
1
f
2
Y
1
_
Q
Y
1
_E
_
F
Y |D=1,X
_
Q
Y
1
_ _
1 F
Y |D=1,X
_
Q
Y
1
__
Pr(D = 1|X)
_
+
1
f
2
Y
0
_
Q
Y
0
_E
_
F
Y |D=0,X
_
Q
Y
0
_ _
1 F
Y |D=0,X
_
Q
Y
0
__
1 Pr(D = 1|X)
_
+E
_
{
1
(X)
0
(X)}
2
_
where
d
(x) = F
Y |D=d,X
(Q
Y
d
)/f
Y
d(Q
Y
d
). Q
Y
0
and Q
Y
1
have already been esti-
mated by and +
Y
d
) are estimated by weighted
kernel estimators
f
Y
d
_
Y
d
_
=
1
nh
n
Di=d
W
F
i
k
_
Y
i
Y
d
h
n
_
with Epanechnikov kernel function and Silverman (1986) bandwidth choice, and where
F
Y |D=d,X
(Q
Y
d
) is estimated by the local logit estimator described in appendix B.
A.4 Unconditional endogenous QTEs
Finally, the asymptotic variance of the estimator dened in (4) is the most tedious and
is given by
M. Fr olich and B. Melly 449
V =
1
P
2
c
f
2
Y
1
|c
Y
1
|c
E
(X, 1)
p(X)
F
Y |D=1,Z=1,X
`
Q
Y
1
|c
1 F
Y |D=1,Z=1,X
`
Q
Y
1
|c
+
1
P
2
c
f
2
Y
1
|c
Y
1
|c
E
(X, 0)
1 p(x)
F
Y |D=1,Z=0,X
`
Q
Y
1
|c
1 F
Y |D=1,Z=0,X
`
Q
Y
1
|c
+
1
P
2
c
f
2
Y
0
|c
Y
0
|c
E
1 (X, 1)
p(X)
F
Y |D=0,Z=1,X
`
Q
Y
0
|c
1 F
Y |D=0,Z=1,X
`
Q
Y
0
|c
+
1
P
2
c
f
2
Y
0
|c
Y
0
|c
E
1 (X, 0)
1 p(X)
F
Y |D=0,Z=0,X
`
Q
Y
0
|c
1 F
Y |D=0,Z=0,X
`
Q
Y
0
|c
+E
(X, 1)
2
11
(X) +{1 (X, 1)}
2
01
(X)
p(X)
+
(X, 0)
2
10
(X) +{1 (X, 0)}
2
00
(X)
1 p(X)
p(X) {1 p(X)}
2
!
where
dz
(x) = F
Y |D=d,Z=z,X
(Q
Y
d
|c
)/P
c
f
Y
d
|c
(Q
Y
d
|c
); p(x) = Pr(Z = 1|X = x);
(x, z) = Pr(D = 1|X = x, Z = z); and P
c
is the fraction of compliers. Q
Y
0
|c
and
Q
Y
1
|c
have already been estimated by
IV
and
IV
+
IV
, respectively. The terms
F
Y |D=d,Z=z,X
(Q
Y
d
), p(X), and (x, z) are estimated by the local logit estimator de-
scribed in appendix B. Finally, P
c
is estimated by
(X
i
, 1) (X
i
, 0).
(Continued on next page)
450 Estimation of quantile treatment eects with Stata
To estimate the densities f
Y
d
|c
(Q
Y
d
|c
), we note that
24
f
Y
d
|c
(Q
Y
d
|c
) = lim
h
1
h
1
_
0
k
_
Q
Y
d
|c
Q
Y
d
|c
h
_
d
where k is the Epanechnikov kernel function with Silverman (1986) bandwidth. We
therefore estimate f
Y
d
|c
as
f
Y
d
|c
(
Y
d
|c
) =
1
h
1
_
0
k
_
Q
Y
d
|c
Y
d
|c
h
_
d
where we replace the integral by a sum of n uniformly spaced values of between 0
and 1.
B Nonparametric regression with mixed data
B.1 Local parametric regression
A key ingredient for the previously introduced estimators (except for the exogenous con-
ditional quantile regression estimator) is the nonparametric estimation of some weights.
Local linear and local logit estimators have been implemented for this purpose. This
is fully automated in the ivqte command. Nevertheless, some understanding of the
nonparametric estimators facilitates the use of the ivqte command.
In many instances, we need to estimate conditional expected values like
E (Y |X = X
i
). We use a local parametric approach throughout; that is, we estimate a
locally weighted version of the parametric model. A complication is that many econo-
metric applications contain continuous as well as discrete regressors X. Both types of
regressors need to be accommodated in the local parametric model and in the kernel
function dening the local neighborhood. The traditional nonparametric approach con-
sists of estimating the model within each of the cells dened by the discrete regressors
24. To see this, note that
1
h
1
Z
0
k
0
@
Q
Y
d
|c
Q
Y
d
|c
h
1
A
d =
k (u) f
Y
d
|c
(uh +Q
Y
d
|c
) du
where we used a change in variables uh = Q
Y
d
|c
Q
Y
d
|c
, which implies that = F
Y
d
|c
(uh +
Q
Y
d
|c
) and d = f
Y
d
|c
(uh + Q
Y
d
|c
) hdu. By the mean value theorem, f
Y
d
|c
(uh + Q
Y
d
|c
) =
f
Y
d
|c
(Q
Y
d
|c
) +uh f
Y
d
|c
(Q), where Q is on the line between Q
Y
d
|c
and uh +Q
Y
d
|c
. Hence,
= f
Y
d
|c
(Q
Y
d
|c
)
k (u) du +O(h) = f
Y
d
|c
(Q
Y
d
|c
) +O(h)
M. Fr olich and B. Melly 451
and of smoothing only with respect to the continuous covariates. When the number
of cells in a dataset is large, each cell may not have enough observations to nonpara-
metrically estimate the relationship among the remaining continuous variables. For
this reason, many applied researchers have treated discrete variables in a parametric
way. We follow an intermediate way and use the hybrid product kernel developed by
Racine and Li (2004). This estimator covers all cases from the fully parametric model
up to the traditional nonparametric estimator.
Overall, we can distinguish four dierent types of regressors: continuous (for exam-
ple, age), ordered discrete (for example, family size), unordered discrete (for example,
regions), and binary variables (for example, gender). We will treat ordered discrete and
continuous variables in the same way and will refer to them as continuous variables in
the following discussion.
25
The unordered discrete and the binary variables are handled dierently in the kernel
function and in the local parametric model. The binary variables enter into both as
single regressors. The unordered discrete variables, however, enter as a single regressor
in the kernel function and as a vector of dummy variables in the local model. Consider,
for example, a variable called region that takes four dierent values: north, south, east,
and west. This variable enters as a single variable in the kernel function but is included
in the local model in the form of three dummies: south, east, and west.
The kernel function is dened in the following paragraph. Suppose that the variables
in X are arranged such that the rst q
1
regressors are continuous (including the ordered
discrete variables) and the remaining Q q
1
regressors are discrete without natural
ordering (including binary variables). The kernel weights K(X
i
x) are computed as
K
h,
(X
i
x) =
q1
q=1
_
X
q,i
x
q
h
_
Q
q=q1+1
1(Xq,i=xq)
where X
q,i
and x
q
denote the qth element of X
i
and x, respectively; 1() is the indicator
function; is a symmetric univariate weighting function; and h and are positive
bandwidth parameters with 0 1. This kernel function measures the distance
between X
i
and x through two components: The rst term is the standard product
kernel for continuous regressors with h dening the size of the local neighborhood. The
second term measures the mismatch between the unordered discrete (including binary)
regressors. denes the penalty for the unordered discrete regressors. For example,
the multiplicative weight contribution of the Qth regressor is 1 if the Qth element of
X
i
and x are identical, and it is if they are dierent. If h = and = 1, then
the nonparametric estimator corresponds to the global parametric estimator and no
interaction term between the covariates is allowed. On the other hand, if is zero
and h is small, then smoothing proceeds only within each of the cells dened by the
25. Racine and Li (2004) suggest using a geometrically declining kernel function for the ordered discrete
regressors. There are no reasons, however, against using quadratically declining kernel weights. In
other words, we can use the same (for example, Epanechnikov) kernel for ordered discrete as for
continuous regressors. We therefore treat ordered discrete regressors in the same way as continuous
regressors in the following discussion.
452 Estimation of quantile treatment eects with Stata
discrete regressors and only observations with similar continuous covariates will be used.
Finally, if and h are in the intermediate range, observations with similar discrete and
continuous covariates will be weighted more but further observations will also be used.
Principally, instead of using only two bandwidth values h, for all regressors, a dif-
ferent bandwidth could be employed for each regressor, but doing so would substantially
increase the computational burden for bandwidth selection. This approach might lead
to additional noise due to estimating these bandwidth parameters. Therefore, we prefer
to use only two smoothing parameters. ivqte automatically orthogonalizes the data
matrix of all continuous regressors to create an identity covariance matrix. This greatly
diminishes the appeal of having multiple bandwidths.
This kernel function, combined with a local model, is used to estimate E (Y |X).
If Y is a continuous variable, then ivqte uses by default a local linear estimator to
estimate E (Y |X = x) as in
( ,
) = arg min
a,b
n
j=1
{Y
j
a b (X
j
x)}
2
K
h,
(X
j
x)
If Y is bound from above and below, a local logistic model is usually preferred. We
suppose in the following discussion that Y is bound within [0, 1].
26
This includes the
special case where Y is binary. The local logit estimator guarantees that the tted
values are always between 0 and 1. The local logit estimator can be used by selecting
the logit option. In this case, E (Y |X = x) is estimated by ( ), where
( ,
) = arg min
a,b
n
j=1
(Y
j
ln {a +b (X
j
x)}
+(1 Y
j
) ln [1 {a +b (X
j
x)}]) K
h,
(X
j
x)
and (x) = 1/1 +e
x
.
As mentioned before, each of the unordered discrete variables enters in the form of
a dummy variable for each of its support points except for an arbitrary base category;
for example, if the region variable takes four dierent values, then three dummies are
included.
The ivqte command requires that the values of the smoothing parameters h and
are supplied by the user. Before estimating local linear or local logit with these
smoothing parameters, ivqte (as well as locreg) rst attempts to estimate the global
model (that is, with h = and = 1). If estimation fails due to collinearity or perfect
prediction, the regressors which caused these problems are eliminated.
27
Thereafter, the
model is estimated locally with the user-supplied smoothing parameters. If estimation
26. If the lower and upper bounds of Y are dierent from 0 and 1, Y should be rescaled to the interval
[0, 1].
27. This is done using rmcollright, where ivqte rst searches for collinearity among the continuous
regressors and thereafter among all other regressors.
M. Fr olich and B. Melly 453
fails locally because of collinearity or perfect prediction, the bandwidths are increased
locally. This is repeated until convergence is achieved.
The locreg command also contains a leave-one-out cross-validation procedure to
choose the smoothing parameters.
28
The user provides a grid of values for h and , and
the cross-validation criterion is computed for all possible combinations of these values.
The values of the cross-validation criterion are returned in r(cross valid) and the
combination that minimizes this criterion is chosen. If only one value is given for h and
, no grid search is performed.
B.2 The locreg command
Because the codes implementing the nonparametric regressions are likely to be of inde-
pendent interest in other contexts, we oer a separate command for the local parametric
regressions. This locreg command implements local linear and local logit regression
and chooses the smoothing parameters by leave-one-out cross-validation. The formal
syntax of locreg is as follows:
locreg depvar
_
if
_
in
_
weight
_
, generate(newvarname
_
, replace
)
continuous(varlist) dummy(varlist) unordered(varlist) kernel(kernel)
bandwidth(#
_
#
_
# . . .
) lambda(#
_
#
_
# . . .
) logit mata opt
sample(varname
_
, replace
aweights and pweights are allowed. See the [U] 11.1.6 weight for more information
on weights.
B.3 Description
locreg computes the nonparametric estimation of the mean of depvar conditionally on
the regressors given in continuous(), dummy(), and unordered(). A mixed kernel is
used to smooth over the continuous and discrete regressors. The tted values are saved
in the variable newvarname. If a list of values is given in bandwidth() or lambda(), the
smoothing parameters h and are estimated via leave-one-out cross-validation. The
values of h and minimizing the cross-validation criterion are selected. These values are
then used to predict depvar, and the tted values are saved in the variable newvarname.
locreg can be used in three dierent ways. First, if only one value is given in
bandwidth() and one in lambda(), locreg estimates the nonparametric regression
using these values and saves the tted values in generate(newvarname). Alternatively,
28. The cross-validated parameters are optimal to estimate the weights but are not optimal to estimate
the unconditional QTE. In the absence of a better method, we oer cross-validation, but the user
should keep in mind that the optimal bandwidths for the unconditional QTE converge to zero at a
faster rate than the bandwidths delivered by cross-validation. The user is therefore encouraged to
also examine the estimated QTE when using some undersmoothing relative to the cross-validation
bandwidths.
454 Estimation of quantile treatment eects with Stata
locreg can also be used to estimate the smoothing parameters via leave-one-out cross-
validation. If we do not specify the generate() option but supply a list of values in
the bandwidth() or lambda() option only the cross-validation is performed. Finally, if
several values are specied in bandwidth() or lambda() when the generate() option is
also specied, locreg estimates the optimal smoothing parameters via cross-validation.
Thereafter, it estimates the conditional means with these smoothing parameters and
returns the tted values in the variable generate(newvarname).
For the nonparametric regression, locreg oers two local models: linear and logistic.
The logistic model is usually preferred if depvar is bound within [0, 1]. This includes
the case where depvar is binary but also incorporates cases where depvar is nonbinary
but bound from above and below. If the lower and upper bounds of depvar are dierent
from 0 and 1, the variable depvar should be rescaled to the interval [0, 1] before using
this command. If depvar is not bound from above and below, the linear model should
be used.
29
B.4 Options
generate(newvarname
_
, replace
h and
in the fitted3
variable.
. locreg nearc4, generate(fitted3) bandwidth(0.2 0.5) lambda(0.5 0.8)
> continuous(exper motheduc) dummy(black) unordered(region)
(output omitted )
The Stata Journal (2010)
10, Number 3, pp. 458481
Translation from narrative text to standard
codes variables with Stata
Federico Belotti
University of Rome Tor Vergata
Rome, Italy
federico.belotti@uniroma2.it
Domenico Depalo
Bank of Italy
Rome, Italy
domenico.depalo@bancaditalia.it
Abstract. In this article, we describe screening, a new Stata command for data
management that can be used to examine the content of complex narrative-text
variables to identify one or more user-dened keywords. The command is useful
when dealing with string data contaminated with abbreviations, typos, or mistakes.
A rich set of options allows a direct translation from the original narrative string
to a user-dened standard coding scheme. Moreover, screening is exible enough
to facilitate the merging of information from dierent sources and to extract or
reorganize the content of string variables.
Editors note. This article refers to undocumented functions of Mata, meaning that
there are no corresponding manual entries. Documentation for these functions is
available only as help les; see help regex.
Keywords: dm0050, screening, keyword matching, narrative-text variables, stan-
dard coding schemes
1 Introduction
Many researchers in varied elds frequently deal with data collected as narrative text,
which are almost useless unless treated. For example,
Electronic patient records (EPRs) are useful for decision making and clinical re-
search only if patient data that are currently documented as narrative text are
coded in standard form (Moorman et al. 1994).
When dierent sources of data use dierent spellings to identify the same unit of in-
terest, the information can be exploited only if codes are made uniform (Raciborski
2008).
Because of verbatim responses to open-ended questions, survey data items must
be converted into nominal categories with a xed coding frame to be useful for
applied research.
These are only three of the many critical examples that motivate an ad hoc command.
Recoding a narrative-text variable into a user-dened standard coding scheme is cur-
rently possible in Stata by combining standard data-management commands (for exam-
ple, generate and replace) with regular expression functions (for example, regexm()).
c 2010 StataCorp LP dm0050
F. Belotti and D. Depalo 459
However, many problems do not yield easily to this approach, especially problems con-
taining complex narrative-text data. Consider, for example, the case when many source
variables can be used to identify a set of keywords; or the case when, looking at dierent
keywords, one is within a given source variable but not necessarily at the beginning of
that variable, whereas the others are at the beginning, the end, or within that or other
source variables. Because no command jointly handles all possible cases, these cases
can be treated with existing Stata commands only after long and tedious programming,
increasing the possibility of introducing errors. We developed the screening command
to ll this gap, simplifying data-cleaning operations while being exible enough to cover
a wide range of situations.
In particular, screening checks the content of one or more string variables (sources)
to identify one or more user-dened regular expressions (keywords). Because string vari-
ables are not exible, to make the command easier and more useful, a set of options
reduces your preparatory burden. You can make the matching task wholly case in-
sensitive or set matching rules aimed at matching keywords at the beginning, the end,
or within one or more sources. If source variables contain periods, commas, dashes,
double blanks, ampersands, parentheses, etc., it is possible to perform the matching by
removing such undesirable content. Moreover, if the matching task becomes more dif-
cult because of abbreviations or even pure mistakes, screening allows you to specify
the number of letters to screen in a keyword. Finally, the command allows a direct
translation of the original string variables in a user-dened standard coding scheme.
All these features make the command simple, extremely exible, and fast, minimizing
the possibility of introducing errors. It is worth emphasizing that we nd Mata more
convenient to use than Stata, with advantages in terms of time execution.
The article is organized as follows. In section 2, we describe the new screening
command, and we provide some useful tips in section 3. Section 4 illustrates the main
features of the command using EPR data, while section 5 details some critical cases in
which the use of screening may aid your decision to merge data from dierent sources
or to extract and reorder messy data. In the last section, section 6, we oer a short
summary.
2 The screening command
String variables are useful in many practical circumstances. A drawback is that they
are not so exible: for example, in EPR data, coding CHOLESTEROL is dierent from
coding CHOLESTEROL LDL, although the broad pathology is the same. Stata and Mata
oer many built-in functions to handle strings. In particular, screening extensively
uses the Mata regular-expression functions regexm(), regexr(), and regexs().
(Continued on next page)
460 From narrative text to standard codes
2.1 Syntax
screening
_
if
_
in
, sources(varlist
_
, sourcesopts
) keys(
_
matching rule
"string"
_
. . .
)
_
letters(#) explore(type) cases(newvar) newcode(newvar
_
, newcodeopts
2.2 Options
sources(varlist
_
, sourcesopts
)
recodes the newcode() newvar according to a user-dened coding scheme. recode()
must contain at least one recoding rule followed by one user dened code. When you
specify recode(1 "user dened code"), the "user dened code" will be used to re-
code all matched cases from the rst keyword within the list specied via the keys()
option. If recode(2,3 "user dened code") is specied, the "user dened code" will
be used to recode all cases for which second and third keywords are simultaneously
matched, and so on. This option can only be specied if the newcode() option is
specied.
checksources checks whether source variables contain special characters. If a match-
ing rule is specied (begin or end via keys()), checksources checks the sources
boundaries accordingly.
tabcheck tabulates all cases from checksources. If there are too many cases, the
option does not produce a table.
memcheck performs a preventive memory check. When memcheck is specied, the
command will exit promptly if the allocated memory is insucient to run screening.
When memory is insucient and screening is run without memcheck, the command
could run for several minutes or even hours before producing the message no room
to add more variables.
462 From narrative text to standard codes
nowarnings suppresses all warning messages.
save saves in r( ) the number of cases detected, matching each source with each key-
word.
time reports elapsed time for execution (seconds).
3 Tips
The low exibility of string variables is a reason for concern. In this section, we provide
some tips to enhance the usefulness of screening. Some tips are useful to execute the
command, while other tips are useful to check the results.
Most importantly, capitalization matters: this means that screening for KEYWORD is
dierent from screening for keyword. If source variables contain HEMINGWAY and you are
searching for Hemingway, screening will not identify such keyword. If suboption upper
(lower) is specied in sources(), keywords will be automatically matched in uppercase
(lowercase).
Choose an appropriate matching rule. The screening default is to match keywords
over the entire content of source variables. By specifying the matching rule begin or
end within the keys() option, you may switch accordingly the matching rule on string
boundaries. For example, if sources contain HEMINGWAY ERNEST and ERNEST HEMINGWAY
and you are searching begin HEMINGWAY, the screening command will identify the
keyword only in the former case. Whether the two cases are equivalent must be evaluated
case by case.
Another issue is how to choose the optimal number of letters to be screened. For
example, with EPR data, dierent physicians might use dierent abbreviations for the
same pathologies. And so talking about a right number of letters is nonsense. As
a rule of thumb, the number of letters should be specied as the minimum number
that uniquely identies the case of interest. Using many letters can be too exclusive,
while using few letters can be too inclusive. In all cases, but in particular when the
appropriate number of letters is unknown, we nd it useful to tabulate all matched cases
through the explore(tab) option. Because it tabulates all possible matches between
all keywords and all source variables, it is the fastest way to explore the data and choose
the best matching strategy (in terms of keywords, matching rule, and letters).
Advanced users can maximize the potentiality of screening by mixing keywords
with Stata regular-expression operators. Mixing in operators allows you to match more-
complex patterns, as we show later in the article.
1
For more details on regular-expression
syntaxes and operators, see the ocial documentation at
http://www.stata.com/support/faqs/data/regex.html.
1. The letters() option does not work if a keyword contains regular-expression operators.
F. Belotti and D. Depalo 463
screening displays several messages to inform you about the eects of the specied
options. For example, consider the case in which you are searching some keywords con-
taining regular-expression operators. screening will display a message with the correct
syntax to search a keyword containing regular-expression operators. The nowarnings
option allows you to suppress all warning messages.
screening generates several temporary variables (proportional to the number of
keywords you are looking for and to the number of sources you are looking from). So
when you are working with a big dataset and your computer is limited in terms of
RAM, it might be a good idea to perform a preventive memory check. When the
memcheck option is specied and the allocated memory is insucient, screening will
exit promptly rather than running for several minutes or even hours before producing
the message no room to add more variables.
We conclude this section with an evaluation of the command in terms of time ex-
ecution using dierent Stata avors and dierent operating systems. In particular, we
compare the latest version of screening written using Mata regular-expression func-
tions with its beta version written entirely using the Stata counterpart. We built three
datasets of 500,000 (A), 5 million (B), and 50 million (C) observations with an ad hoc
source variable containing 10 dierent words: HEMINGWAY, FITZGERALD, DOSTOEVSKIJ,
TOLSTOJ, SAINT-EXUPERY, HUGO, CERVANTES, BUKOWSKI, DUMAS, and DESSI. Screening
for HEMINGWAY (50% of total cases) gives the following results (in seconds):
Stata avor and Mata Stata
operating system A B C A B C
Stata/SE 10 (32-bit) and
0.66 6.67 na 0.93 9.24 na
Mac OS X 10.5.8 (64-bit)
n
i=1
(y y)
r
n
so that m
2
and s =
m
2
are versions of, respectively, the sample variance and sample
standard deviation.
Here sample skewness is dened as
m
3
m
3/2
2
=
m
3
s
3
=
_
b
1
= g
1
1. Kaplanskys paper is one of a few that he wrote in the mid-1940s on probability and statistics. He
is much better known as a distinguished algebraist (Bass and Lam 2007; Kadison 2008).
484 Speaking Stata: The limits of sample skewness and kurtosis
while sample kurtosis is dened as
m
4
m
2
2
=
m
4
s
4
= b
2
= g
2
+ 3
Hence, both of the last two measures are scaled or dimensionless: Whatever units
of measurement were used appear raised to the same powers in both numerator and
denominator, and so cancel out. The commonly used m, s, b, and g notation corresponds
to a longstanding , , , and notation for the corresponding theoretical or population
quantities. If 3 appears to be an arbitrary constant in the last equation, one explanation
starts with the fact that normal or Gaussian distributions have
1
= 0 and
2
= 3; hence,
2
= 0.
Naturally, if y is constant, then m
2
is zero; thus skewness and kurtosis are not
dened. This includes the case of n = 1. The stipulations that y is genuinely variable
and that n 2 underlie what follows.
Newcomers to this territory are warned that usages in the statistical literature vary
considerably, even among entirely competent authors. This variation means that dier-
ent formulas may be found for the same termsskewness and kurtosisand dierent
terms for the same formulas. To start at the beginning: Although Karl Pearson in-
troduced the term skewness, and also made much use of
1
, he used skewness to refer
to (mean mode) / standard deviation, a quantity that is well dened in his system
of distributions. In more recent literature, some dierences reect the use of divisors
other than n, usually with the intention of reducing bias, and so resembling in spirit
the common use of n 1 as an alternative divisor for sample variance. Some authors
call
2
(or g
2
) the kurtosis, while yet other variations may be found.
The key results for this column were extensively discussed by Wilkins (1944) and
Dalen (1987). Clearly, g
1
may be positive, zero, or negative, reecting the sign of m
3
.
Wilkins (1944) showed that there is an upper limit to its absolute value,
|g
1
|
n 2
n 1
(1)
as was also independently shown by Kirby (1974). In contrast, b
2
must be positive and
indeed (as may be shown, for example, using the CauchySchwarz inequality) must be
at least 1. More pointedly, Dalen (1987) showed that there is also an upper limit to its
value:
b
2
n
2
3n + 3
n 1
(2)
The proofs of these inequalities are a little too long, and not quite interesting enough,
to reproduce here.
Both of these inequalities are sharp, meaning attainable. Test cases to explore the
precise limits have all values equal to some constant, except for one value that is equal
to another constant: n = 2, y
1
= 0, y
2
= 1 will do ne as a concrete example, for which
skewness is 0/1 = 0 and kurtosis is (1 3 + 3)/1 = 1.
N. J. Cox 485
For n = 2, we can rise above a mere example to show quickly that these results are
indeed general. The mean of two distinct values is halfway between them so that the
two deviations y
i
y have equal magnitude and opposite sign. Thus their cubes have
sum 0, and m
3
and b
1
are both identically equal to 0. Alternatively, such values are
geometrically two points on the real line, a conguration that is evidently symmetric
around the mean in the middle, so skewness can be seen to be zero without any calcula-
tions. The squared deviations have an average equal to {(y
1
y
2
)/2}
2
, and their fourth
powers have an average equal to {(y
1
y
2
)/2}
4
, so g
2
is identically equal to 1.
To see how the upper limit behaves numerically, we can rewrite (1) as
|g
1
|
n 1
1
n 1
so that as sample size n increases, rst
n 1 and then
0.1
k
0.9
10k
, k =
0, . . . , 10. For both of these skewed counterexamples, mean, median, and mode coincide at 1. To
the statement that they coincide in a symmetric distribution (p. 108), counterexamples are any
symmetric distribution with an even number of modes.
486 Speaking Stata: The limits of sample skewness and kurtosis
2.2 An aside on coecient of variation
The literature contains similar limits related to sample size on other sample statistics.
For example, the coecient of variation is the ratio of standard deviation to mean, or
s/y. Katsnelson and Kotz (1957) proved that so long as all y
i
0, then the coecient of
variation cannot exceed
n 1, a result mentioned earlier by Longley (1952). Cramer
(1946, 357) proved a less sharp result, and Kirby (1974) proved a less general result.
3 Conrmations
[R] summarize conrms that skewness b
1
and kurtosis g
2
are calculated in Stata pre-
cisely as above. There are no corresponding Mata functions at the time of this writing,
but readers interested in these questions will want to start Mata to check their own
understanding. One example to check is
. sysuse auto, clear
(1978 Automobile Data)
. summarize mpg, detail
Mileage (mpg)
Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of Wgt. 74
50% 20 Mean 21.2973
Largest Std. Dev. 5.785503
75% 25 34
90% 29 35 Variance 33.47205
95% 34 35 Skewness .9487176
99% 41 41 Kurtosis 3.975005
The detail option is needed to get skewness and kurtosis results from summarize.
We will not try to write a bulletproof skewness or kurtosis function in Mata, but we
will illustrate its use calculator-style. After entering Mata, a variable can be read into
a vector. It is helpful to have a vector of deviations from the mean to work on.
. mata :
mata (type end to exit)
: y = st_data(., "mpg")
: dev = y :- mean(y)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
.9487175965
: mean(dev:^4) / (mean(dev:^2)):^2
3.975004596
So those examples at least check out. Those unfamiliar with Mata might note that
the colon prex, as in :- or :^, merely ags an elementwise operation. Thus for example,
mean(y) returns a constant, which we wish to subtract from every element of a data
vector.
N. J. Cox 487
Mata may be used to check simple limiting cases. The minimal dataset (0, 1) may
be entered in deviation form. After doing so, we can just repeat earlier lines to calculate
b
1
and g
2
:
: dev = (.5 \ -.5)
: mean(dev:^3) / (mean(dev:^2)):^(3/2)
0
: mean(dev:^4) / (mean(dev:^2)):^2
1
Mata may also be used to see how the limits of skewness and kurtosis vary with
sample size. We start out with a vector containing some sample sizes. We then calculate
the corresponding upper limits for skewness and kurtosis and tabulate the results. The
results are mapped to strings for tabulation with reasonable numbers of decimal places.
: n = (2::20\50\100\500\1000)
: skew = sqrt(n:-1) :- (1:/(n:-1))
: kurt = n :- 2 + (1:/(n:-1))
: strofreal(n), strofreal((skew, kurt), "%4.3f")
1 2 3
1 2 0.000 1.000
2 3 0.914 1.500
3 4 1.399 2.333
4 5 1.750 3.250
5 6 2.036 4.200
6 7 2.283 5.167
7 8 2.503 6.143
8 9 2.703 7.125
9 10 2.889 8.111
10 11 3.062 9.100
11 12 3.226 10.091
12 13 3.381 11.083
13 14 3.529 12.077
14 15 3.670 13.071
15 16 3.806 14.067
16 17 3.938 15.062
17 18 4.064 16.059
18 19 4.187 17.056
19 20 4.306 18.053
20 50 6.980 48.020
21 100 9.940 98.010
22 500 22.336 498.002
23 1000 31.606 998.001
The second and smaller term in each expression for (1) and (2) is 1/(n1). Although
the calculation is, or should be, almost mental arithmetic, we can see how quickly this
term shrinks so much that it can be neglected:
488 Speaking Stata: The limits of sample skewness and kurtosis
: strofreal(n), strofreal(1 :/ (n :- 1), "%4.3f")
1 2
1 2 1.000
2 3 0.500
3 4 0.333
4 5 0.250
5 6 0.200
6 7 0.167
7 8 0.143
8 9 0.125
9 10 0.111
10 11 0.100
11 12 0.091
12 13 0.083
13 14 0.077
14 15 0.071
15 16 0.067
16 17 0.062
17 18 0.059
18 19 0.056
19 20 0.053
20 50 0.020
21 100 0.010
22 500 0.002
23 1000 0.001
: end
These calculations are equally easy in Stata when you start with a variable containing
sample sizes.
4 Explorations
In statistical science, we use an increasing variety of distributions. Even when closed-
form expressions exist for their moments, which is far from being universal, the need
to estimate parameters from sample data often arises. Thus the behavior of sample
moments and derived measures remains of key interest. Even if you do not customarily
use, for example, summarize, detail to get skewness and kurtosis, these measures may
well underlie your favorite test for normality.
The limits on sample skewness and kurtosis impart the possibility of bias whenever
the upper part of their sampling distributions is cut o by algebraic constraints. In
extreme cases, a sample may even deny the distribution that underlies it, because it is
impossible for any sample to reproduce the skewness and kurtosis of its parent.
These questions may be explored by simulation. Lognormal distributions oer simple
but striking examples. We call a distribution for y lognormal if ln y is normally dis-
tributed. Those who prefer to call normal distributions by some other name (Gaussian,
notably) have not noticeably aected this terminology. Similarly, for some people the
terminology is backward, because a lognormal distribution is an exponentiated normal
distribution. Protest is futile while the term lognormal remains entrenched.
N. J. Cox 489
If ln y has mean and standard deviation , its skewness and kurtosis may be
dened in terms of exp(
2
) = (Johnson, Kotz, and Balakrishnan 1994, 212):
1
=
1( + 2);
2
=
4
+ 2
3
+ 3
2
3
Dierently put, skewness and kurtosis depend on alone; is a location parameter for
the lognormal as well as the normal.
[R] simulate already has a worked example of the simulation of lognormals, which
we can adapt slightly for the present purpose. The program there called lnsim merely
needs to be modied by adding results for skewness and kurtosis. As before, summarize,
detail is now the appropriate call. Before simulation, we (randomly, capriciously, or
otherwise) choose a seed for random-number generation:
. clear all
. program define lnsim, rclass
1. version 11.1
2. syntax [, obs(integer 1) mu(real 0) sigma(real 1)]
3. drop _all
4. set obs `obs
5. tempvar z
6. gen `z = exp(rnormal(`mu,`sigma))
7. summarize `z, detail
8. return scalar mean = r(mean)
9. return scalar var = r(Var)
10. return scalar skew = r(skewness)
11. return scalar kurt = r(kurtosis)
12. end
. set seed 2803
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50) mu(-3) sigma(7)
command: lnsim, obs(50) mu(-3) sigma(7)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
We are copying here the last example from help simulate, a lognormal for which
= 3, = 7. While a lognormal may seem a fairly well-behaved distribution, a quick
calculation shows that with these parameter choices, the skewness is about 810
31
and
the kurtosis about 10
85
, which no sample result can possibly come near! The previously
discussed limits are roughly 7 for skewness and 48 for kurtosis for this sample size. Here
are the Mata results:
. mata
mata (type end to exit)
: omega = exp(49)
: sqrt(omega - 1) * (omega + 2)
8.32999e+31
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
1.32348e+85
: n = 50
490 Speaking Stata: The limits of sample skewness and kurtosis
: sqrt(n:-1) :- (1:/(n:-1)), n :- 2 + (1:/(n:-1))
1 2
1 6.979591837 48.02040816
: end
Sure enough, calculations and a graph (shown as gure 1) show the limits of 7 and
48 are biting hard. Although many graph forms would work well, I here choose qplot
(Cox 2005) for quantile plots.
. summarize
Variable Obs Mean Std. Dev. Min Max
mean 10000 1.13e+09 1.11e+11 1.888205 1.11e+13
var 10000 6.20e+23 6.20e+25 42.43399 6.20e+27
skew 10000 6.118604 .9498364 2.382902 6.857143
kurt 10000 40.23354 10.06829 7.123528 48.02041
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2
2
3
4
5
6
7
s
k
e
w
n
e
s
s
0 .2 .4 .6 .8 1
fraction of the data
10
20
30
40
50
k
u
r
t
o
s
i
s
0 .2 .4 .6 .8 1
fraction of the data
Figure 1. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with = 3, = 7. Upper limits are shown by horizontal lines.
The natural comment is that the parameter choices in this example are a little
extreme, but the same phenomenon occurs to some extent even with milder choices.
With the default = 0, = 1, the skewness and kurtosis are less explosively highbut
still very high by many standards. We clear the data and repeat the simulation, but
this time we use the default values.
N. J. Cox 491
. clear
. simulate mean=r(mean) var=r(var) skew=r(skew) kurt=r(kurt), nodots
> reps(10000): lnsim, obs(50)
command: lnsim, obs(50)
mean: r(mean)
var: r(var)
skew: r(skew)
kurt: r(kurt)
Within Mata, we can recalculate the theoretical skewness and kurtosis. The limits
to sample skewness and kurtosis remain the same, given the same sample size n = 50.
. mata
mata (type end to exit)
: omega = exp(1)
: sqrt(omega - 1) * (omega + 2)
6.184877139
: omega^4 + 2 * omega^3 + 3*omega^2 - 3
113.9363922
: end
The problem is more insidious with these parameter values. The sampling distri-
butions look distinctly skewed (shown in gure 2) but are not so obviously truncated.
Only when the theoretical values for skewness and kurtosis are considered is it obvious
that the estimations are seriously biased.
. summarize
Variable Obs Mean Std. Dev. Min Max
mean 10000 1.657829 .3106537 .7871802 4.979507
var 10000 4.755659 7.43333 .3971136 457.0726
skew 10000 2.617803 1.092607 .467871 6.733598
kurt 10000 11.81865 7.996084 1.952879 46.89128
. qplot skew, yla(, ang(h)) name(g1, replace) ytitle(skewness) yli(6.98)
. qplot kurt, yla(, ang(h)) name(g2, replace) ytitle(kurtosis) yli(48.02)
. graph combine g1 g2
(Continued on next page)
492 Speaking Stata: The limits of sample skewness and kurtosis
0
2
4
6
8
s
k
e
w
n
e
s
s
0 .2 .4 .6 .8 1
fraction of the data
0
10
20
30
40
50
k
u
r
t
o
s
i
s
0 .2 .4 .6 .8 1
fraction of the data
Figure 2. Sampling distributions of skewness and kurtosis for samples of size 50 from a
lognormal with = 0, = 1. Upper limits are shown by horizontal lines.
Naturally, these are just token simulations, but a way ahead should be clear. If
you are using skewness or kurtosis with small (or even large) samples, simulation with
some parent distributions pertinent to your work is a good idea. The simulations of
Wallis, Matalas, and Slack (1974) in particular pointed to empirical limits to skewness,
which Kirby (1974) then established independently of previous work.
3
5 Conclusions
This story, like any other, lies at the intersection of many larger stories. Many statisti-
cally minded people make little or no use of skewness or kurtosis, and this paper may
have conrmed them in their prejudices. Some readers may prefer to see this as an-
other argument for using quantiles or order statistics for summarization (Gilchrist 2000;
David and Nagaraja 2003). Yet others may know that L-moments oer an alternative
approach (Hosking 1990; Hosking and Wallis 1997).
Arguably, the art of statistical analysis lies in choosing a model successful enough
to ensure that the exact form of the distribution of some response variable, conditional
on the predictors, is a matter of secondary importance. For example, in the simplest
regression situations, an error term for any really good model is likely to be fairly near
normally distributed, and thus not a source of worry. But authorities and critics dier
over how far that is a deductive consequence of some avor of central limit theorem or
a nave article of faith that cries out for critical evaluation.
3. Connoisseurs of obeat or irreverent titles might like to note some other papers by the same team:
Mandelbrot and Wallis (1968), Matalas and Wallis (1973), and Slack (1973).
N. J. Cox 493
More prosaically, it is a truismbut one worthy of assentthat researchers using
statistical methods should know the strengths and weaknesses of the various items in
the toolbox. Skewness and kurtosis, over a century old, may yet oer surprises, which
a wide range of Stata and Mata commands may help investigate.
6 Acknowledgments
This column benets from interactions over moments shared with Ian S. Evans and over
L-moments shared with Patrick Royston.
7 References
Bass, H., and T. Y. Lam. 2007. Irving Kaplansky 19172006. Notices of the American
Mathematical Society 54: 14771493.
Cox, N. J. 2005. Speaking Stata: The protean quantile plot. Stata Journal 5: 442460.
Cramer, H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton Uni-
versity Press.
Dalen, J. 1987. Algebraic bounds on standardized sample moments. Statistics & Prob-
ability Letters 5: 329331.
David, H. A. 2001. First (?) occurrence of common terms in statistics and probability.
In Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F.
Edwards, 209246. New York: Springer.
David, H. A., and H. N. Nagaraja. 2003. Order Statistics. Hoboken, NJ: Wiley.
Fiori, A. M., and M. Zenga. 2009. Karl Pearson and the origin of kurtosis. International
Statistical Review 77: 4050.
Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.
Gilchrist, W. G. 2000. Statistical Modelling with Quantile Functions. Boca Raton, FL:
Chapman & Hall/CRC.
Hald, A. 1981. T. N. Thieles contribution to statistics. International Statistical Review
49: 120.
. 1998. A History of Mathematical Statistics from 1750 to 1930. New York:
Wiley.
. 2007. A History of Parametric Statistical Inference from Bernoulli to Fisher,
17131935. New York: Springer.
494 Speaking Stata: The limits of sample skewness and kurtosis
Hosking, J. R. M. 1990. L-moments: Analysis and estimation of distributions using lin-
ear combinations of order statistics. Journal of the Royal Statistical Society, Series B
52: 105124.
Hosking, J. R. M., and J. R. Wallis. 1997. Regional Frequency Analysis: An Approach
Based on L-Moments. Cambridge: Cambridge University Press.
Johnson, M. E., and V. W. Lowe. 1979. Bounds on the sample skewness and kurtosis.
Technometrics 21: 377378.
Johnson, N. L., S. Kotz, and N. Balakrishnan. 1994. Continuous Univariate Distribu-
tions, Vol. 1. New York: Wiley.
Kadison, R. V. 2008. Irving Kaplanskys role in mid-twentieth century functional anal-
ysis. Notices of the American Mathematical Society 55: 216225.
Kaplansky, I. 1945. A common error concerning kurtosis. Journal of the American
Statistical Association 40: 259.
Katsnelson, J., and S. Kotz. 1957. On the upper limits of some measures of variability.
Archiv f ur Meteorologie, Geophysik und Bioklimatologie, Series B 8: 103107.
Kirby, W. 1974. Algebraic boundedness of sample statistics. Water Resources Research
10: 220222.
. 1981. Letter to the editor. Technometrics 23: 215.
Lauritzen, S. L. 2002. Thiele: Pioneer in Statistics. Oxford: Oxford University Press.
Longley, R. W. 1952. Measures of the variability of precipitation. Monthly Weather
Review 80: 111117.
Mandelbrot, B. B., and J. R. Wallis. 1968. Noah, Joseph, and operational hydrology.
Water Resources Research 4: 909918.
Matalas, N. C., and J. R. Wallis. 1973. Eureka! It ts a Pearson type 3 distribution.
Water Resources Research 9: 281289.
Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course
in Statistics. Reading, MA: AddisonWesley.
Pearson, K. 1916. Mathematical contributions to the theory of evolution. XIX: Second
supplement to a memoir on skew variation. Philosophical Transactions of the Royal
Society of London, Series A 216: 429457.
Slack, J. R. 1973. I would if I could (self-denial by conditional models). Water Resources
Research 9: 247249.
Stuart, A., and J. K. Ord. 1994. Kendalls Advanced Theory of Statistics. Volume 1:
Distribution Theory. 6th ed. London: Arnold.
N. J. Cox 495
Taylor, S. J. 2005. Asset Price Dynamics, Volatility, and Prediction. Princeton, NJ:
Princeton University Press.
Thiele, T. N. 1889. Forlsinger over Almindelig Iagttagelseslre: Sandsynlighedsregn-
ing og Mindste Kvadraters Methode. Copenhagen: C. A. Reitzel. English translation
included in Lauritzen 2002.
Walker, H. M. 1929. Studies in the History of Statistical Method: With Special Refer-
ence to Certain Educational Problems. Baltimore: Williams & Wilkins.
Wallis, J. R., N. C. Matalas, and J. R. Slack. 1974. Just a moment! Water Resources
Research 10: 211219.
Wilkins, J. E. 1944. A note on skewness and kurtosis. Annals of Mathematical Statistics
15: 333335.
Yule, G. U. 1911. An Introduction to the Theory of Statistics. London: Grin.
About the author
Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,
postings, FAQs, and programs to the Stata user community. He has also coauthored 15 com-
mands in ocial Stata. He wrote several inserts in the Stata Technical Bulletin and is an editor
of the Stata Journal.
The Stata Journal (2010)
10, Number 3, pp. 496499
Stata tip 89: Estimating means and percentiles following
multiple imputation
Peter A. Lachenbruch
Oregon State University
Corvallis, OR
peter.lachenbruch@oregonstate.edu
1 Introduction
In a statistical analysis, I usually want some basic descriptive statistics such as the mean,
standard deviation, extremes, and percentiles. See, for example, Pagano and Gauvreau
(2000). Stata conveniently provides these descriptive statistics with the summarize
commands detail option. Alternatively, I can obtain percentiles with the centile
command. For example, with auto.dta, we have
. sysuse auto
(1978 Automobile Data)
. summarize price, detail
Price
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
However, if I have missing values, the summarize command is not supported by mi
estimate or by the user-written mim command (Royston 2004, 2005a,b, 2007; Royston,
Carlin, and White 2009).
2 Finding means and percentiles when missing values are
present
For a general multiple-imputation reference, see Stata 11 Multiple-Imputation Reference
Manual (2009). By recognizing that a regression with no independent variables estimates
the mean, I can use mi estimate: regress to get multiply imputed means. If I wish to
get multiply imputed quantiles, I can use mi estimate: qreg or mi estimate: sqreg
for this purpose.
c 2010 StataCorp LP st0205
P. A. Lachenbruch 497
I now create a dataset with missing values of price:
. clonevar newprice = price
. set seed 19670221
. replace newprice = . if runiform() < .4
(32 real changes made, 32 to missing)
The following commands were generated from the multiple-imputation dialog box. I
used 20 imputations. Before Stata 11, this could also be done with the user-written com-
mands ice and mim (Royston 2004, 2005a,b, 2007; Royston, Carlin, and White 2009).
. mi set mlong
. mi register imputed newprice
(32 m=0 obs. now marked as incomplete)
. mi register regular mpg trunk weight length
. mi impute regress newprice, add(20) rseed(3252010)
Univariate imputation Imputations = 20
Linear regression added = 20
Imputed: m=1 through m=20 updated = 0
Observations per m
Variable complete incomplete imputed total
newprice 42 32 32 74
(complete + incomplete = total; imputed is the minimum across m
of the number of filled in observations.)
. mi estimate: regress newprice
Multiple-imputation estimates Imputations = 20
Linear regression Number of obs = 74
Average RVI = 1.3880
Complete DF = 73
DF: min = 19.46
avg = 19.46
DF adjustment: Small sample max = 19.46
F( 0, .) = .
Within VCE type: OLS Prob > F = .
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
_cons 5693.489 454.9877 12.51 0.000 4742.721 6644.258
From this output, we see that the estimated mean is 5,693 with a standard error
of 455 (rounded up) compared with the complete data value of 6,165 with a standard
error of 343 (also rounded up). However, we do not have estimates of quantiles. This
could also have been done using mi estimate: mean newprice (the mean command is
near the bottom of the estimation command list for mi estimate).
498 Stata tip 89
We can apply the same principle using qreg. For the 10th percentile, type
. mi estimate: qreg newprice, quantile(10)
Multiple-imputation estimates Imputations = 20
.1 Quantile regression Number of obs = 74
Average RVI = 0.2901
Complete DF = 73
DF: min = 48.05
avg = 48.05
DF adjustment: Small sample max = 48.05
F( 0, .) = .
Prob > F = .
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
_cons 3495.635 708.54 4.93 0.000 2071.058 4920.212
Compare the value of 3,496 with the value of 3,895 from the full data. We can use
the simultaneous estimates command for the full set:
. mi estimate: sqreg newprice, quantiles(10 25 50 75 90) reps(20)
Multiple-imputation estimates Imputations = 20
Simultaneous quantile regression Number of obs = 74
Average RVI = 0.6085
Complete DF = 73
DF adjustment: Small sample DF: min = 23.19
avg = 26.97
max = 31.65
newprice Coef. Std. Err. t P>|t| [95% Conf. Interval]
q10
_cons 3495.635 533.5129 6.55 0.000 2408.434 4582.836
q25
_cons 4130.037 237.1932 17.41 0.000 3642.459 4617.614
q50
_cons 5200.238 441.294 11.78 0.000 4292.719 6107.757
q75
_cons 6620.232 778.8488 8.50 0.000 5025.49 8214.974
q90
_cons 8901.985 1417.022 6.28 0.000 5971.962 11832.01
3 Comments and cautions
The qreg command does not give the same result as the centile command when
you have complete data. This is because the centile command uses one observation,
while the qreg command uses a weighted combination of the observations. It will have
somewhat shorter condence intervals, but with large datasets, the dierence will be
P. A. Lachenbruch 499
small. A second caution is that comparing two medians can be tricky: the dierence
of two medians is not the median dierence of the distributions. I have found it useful
to use percentiles because there is a one-to-one relationship between percentiles if data
are transformed. In our case, there is plentiful evidence that price is not normally
distributed, so it would be good to look for a transformation and impute those values.
This method of using regression commands without an independent variable can
provide estimates of quantities that otherwise would be dicult to obtain. For example,
it is much faster than nding 20 imputed percentiles and then combining them with
Rubins rules, and it is much less onerous and prone to error.
4 Acknowledgment
This work was supported in part by a grant from the Cure JM Foundation.
References
Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA:
Duxbury.
Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4: 227241.
. 2005a. Multiple imputation of missing values: Update. Stata Journal 5: 188
201.
. 2005b. Multiple imputation of missing values: Update of ice. Stata Journal 5:
527536.
. 2007. Multiple imputation of missing values: Further update of ice, with an
emphasis on interval censoring. Stata Journal 7: 445464.
Royston, P., J. B. Carlin, and I. R. White. 2009. Multiple imputation of missing values:
New features for mim. Stata Journal 9: 252264.
StataCorp. 2009. Stata 11 Multiple-Imputation Reference Manual. College Station, TX:
Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 500502
Stata tip 90: Displaying partial results
Martin Weiss
Department of Economics
T ubingen University
T ubingen, Germany
martin.weiss@uni-tuebingen.de
Stata provides several features that allow users to display only part of their results.
If, for instance, you merely wanted to inspect the analysis of variance table returned by
anova or the coecients returned by regress, you could instruct Stata to omit other
results:
. sysuse auto
(1978 Automobile Data)
. regress weight length price, notable
Source SS df MS Number of obs = 74
F( 2, 71) = 385.80
Model 40378658.3 2 20189329.2 Prob > F = 0.0000
Residual 3715520.06 71 52331.2685 R-squared = 0.9157
Adj R-squared = 0.9134
Total 44094178.4 73 604029.841 Root MSE = 228.76
. regress weight length price, noheader
weight Coef. Std. Err. t P>|t| [95% Conf. Interval]
length 30.60949 1.333171 22.96 0.000 27.95122 33.26776
price .042138 .0100644 4.19 0.000 .0220702 .0622058
_cons -2992.848 232.1722 -12.89 0.000 -3455.786 -2529.91
Other examples of this type can be found in the help les for xtivreg for its rst-
stage results and for xtmixed for its random-eects and xed-eects table. Generally,
to check whether Stata does provide such options, you would look for them under the
heading Reporting in the respective help les.
If you want to further customize output to your own needs, you could use the
estimates table command; see [R] estimates table. It is part of the comprehensive
estimates suite of commands that save and manipulate estimation results in Stata. See
[R] estimates or Baum (2006, sec. 4.4), where user-written alternatives are introduced
as well.
estimates table can provide several benets to the user. For one, you can restrict
output to selected coecients or equations with its keep() and drop() options.
c 2010 StataCorp LP st0206
M. Weiss 501
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price)
Variable active
turn 35.214901
price .04624804
The original output of the estimation command itself is suppressed with quietly;
see [P] quietly. The keep() option also changes the order of the coecients according
to your wishes. Additionally, you can elect to have Stata display results in a specic
format, for example, with fewer or more decimal places. The format can dier between
the elements that you choose to put into the table. In the case shown below, the
coecients have three decimal places, while the standard error and the p-value have
two decimal places:
. sysuse auto
(1978 Automobile Data)
. quietly regress weight length price trunk turn
. estimates table, keep(turn price) b(%9.3fc) se(%9.2fc) p(%9.2fc)
Variable active
turn 35.215
11.65
0.00
price 0.046
0.01
0.00
legend: b/se/p
estimates table can also deal with models featuring multiple equations. If you
want to omit the coecients for weight and the constant from every equation of your
sureg model, you could type
. sysuse auto
(1978 Automobile Data)
. qui sureg (price foreign weight length turn) (mpg foreign weight turn)
. estimates table, drop(weight _cons)
Variable active
price
foreign 3320.6181
length -78.75447
turn -144.37952
mpg
foreign -2.0756325
turn -.23516574
502 Stata tip 90
If your interest rests in the entire rst equation and the constant from the second
equation, you would prepend coecients with the equation names and separate the two
with a colon. The names of equations and coecients are more accessible in Stata 11
with the coeflegend option, which is accepted by most estimation commands.
. sureg, coeflegend noheader
Coef. Legend
price
foreign 3320.618 _b[price:foreign]
weight 6.04491 _b[price:weight]
length -78.75447 _b[price:length]
turn -144.3795 _b[price:turn]
_cons 7450.657 _b[price:_cons]
mpg
foreign -2.075632 _b[mpg:foreign]
weight -.0055959 _b[mpg:weight]
turn -.2351657 _b[mpg:turn]
_cons 48.13492 _b[mpg:_cons]
. estimates table, keep(price: mpg:weight)
Variable active
price
foreign 3320.6181
weight 6.0449101
length -78.75447
turn -144.37952
_cons 7450.657
mpg
weight -.00559588
See help estimates table to learn more about the syntax.
Reference
Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College
Station, TX: Stata Press.
The Stata Journal (2010)
10, Number 3, pp. 503504
Stata tip 91: Putting unabbreviated varlists into local
macros
Nicholas J. Cox
Department of Geography
Durham University
Durham, UK
n.j.cox@durham.ac.uk
Within interactive sessions, do-les, or programs, Stata users often want to work
with varlists, lists of variable names. For convenience, such lists may be stored in local
macros. Local macros can be directly dened for later use, as in
. local myx "weight length displacement"
. regress mpg `myx
However, users frequently want to put longer lists of names into local macros, spelled
out one by one so that some later command can loop through the list dened by the
macro. Such varlists might be indirectly dened in abbreviations using the wildcard
characters * or ?. These characters can be used alone or can be combined to express
ranges. For example, specifying * catches all variables, *TX* might dene all variables
for Texas, and *200? catches the years 20002009 used as suxes.
In such cases, direct denition may not appeal for all the obvious reasons: it is
tedious, time-consuming, and error-prone. It is also natural to wonder if there is a
better method. You may already know that foreach (see [P] foreach) will take such
wildcarded varlists as arguments, which solves many problems.
Many users know that pushing an abbreviated varlist through describe or ds is
one way to produce an unabbreviated varlist. Thus
. describe, varlist
is useful principally for its side eect of leaving all the variable names in r(varlist).
ds is typically used in a similar way, as is the user-written findname command (Cox
2010).
However, if the purpose is just to produce a local macro, the method of using
describe or ds has some small but denite disadvantages. First, the output of each
may not be desired, although it is easily suppressed with a quietly prex. Second,
the modus operandi of both describe and ds is to leave saved results as r-class results.
Every now and again, users will be frustrated by this when they unwittingly overwrite
r-class results that they wished to use again. Third, there is some ineciency in using
either command for this purpose, although you would usually have to work hard to
measure it.
c 2010 StataCorp LP dm0051
504 Stata tip 91
The solution here is to use the unab command; see [P] unab. unab has just one
restricted role in life, but that role is the solution here. unab is billed as a programming
command, but nothing stops it from being used interactively as a simple tool in data
management. The simple examples
. unab myvars : *
. unab TX : *TX*
. unab twenty : *200?
show how a local macro, named at birth (here as myvars, TX, and twenty), is dened
as the unabbreviated equivalent of each argument that follows a colon. Note that using
wildcard characters, although common, is certainly not compulsory.
The word unabbreviate is undoubtedly ugly. The help and manual entry do also
use the much simpler and more attractive word expand, but the word expand was
clearly not available as a command name, given its use for another purpose.
This tip skates over all the ne details of unab, and only now does it mention the
sibling commands tsunab and fvunab, for use when you are using time-series operators
and factor variables. For more information, see [P] unab.
Reference
Cox, N. J. 2010. Speaking Stata: Finding variables. Stata Journal 10: 281296.
The Stata Journal (2010)
10, Number 3, p. 505
Software Updates
st0140 2: fuzzy: A program for performing qualitative comparative analyses (QCA) in
Stata. K. C. Longest and S. Vaisey. Stata Journal 8: 452; 79104.
A typo has been xed in the setgen subcommand. Specically, the drect extension
was not calculating values below the middle anchor correctly because of the typo.
This has been xed and drect is now operating correctly. Note that no other aspects
of the setgen command have been altered.
c 2010 StataCorp LP up0029