Documente Academic
Documente Profesional
Documente Cultură
Andrew Gelman2
7 Aug 2010
As a statistician, I was trained to think of randomized experimentation as representing the gold
standard of knowledge in the social sciences, and, despite having seen occasional arguments to
the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that To
find out what happens when you change something, it is necessary to change it.3
At the same time, in my capacity as a social scientist, Ive published many applied research
papers, almost none of which have used experimental data.4
In the present article, Ill address the following questions:
1. Why do I agree with the consensus characterization of randomized experimentation as a
gold standard?
2. Given point 1 above, why does almost all my research use observational data?
In confronting these issues, we must consider some general issues in the strategy of social
science research. We also take from the psychology methods literature a more nuanced
perspective that considers several different aspects of research design and goes beyond the
simple division into randomized experiments, observational studies, and formal theory.
My practical advice is that we should be doing more field experiments but that simple
comparisons and regressions are, and should be, here to stay. We can always interpret such
analyses descriptively, and description is an important part of social science, both in its own right
and in providing foundations upon which to build formal models. Observational studies can also
be interpreted causally when attached to assumptions which can be attacked or defended on their
own terms.
Beyond this, the best statistical methods for experiments and observational studies are not so
different. There are historical and even statistical reasons why experimentalists have focused on
1
For Field Experiments and their Critics, ed. Dawn Teele. Yale University Press.
Department of Statistics and Department of Political Science, Columbia University, New York,
gelman@stat.columbia.edu, http://www.stat.columbia.edu/gelman/ We thank Dawn Teele for soliciting this article
and the Institute for Education Sciences for grant R305D090006-09A.
3
Box, Hunter, and Hunter refer to changing somethingthat is experimentationwithout reference to
randomization, a point to which we return later in this article.
4
I will restrict my discussion to social science examples. Social scientists are often tempted to illustrate their ideas
with examples from medical research. When it comes to medicine, though, we are, with rare exceptions, at best
ignorant laypersons (in my case, not even reaching that level), and it is my impression that by reaching for medical
analogies we are implicitly trying to borrow some of the scientific and cultural authority of that field for our own
purposes. Evidence-based medicine is the subject of a large literature of its own (see, for example, Lau, Ioannidis,
and Schmid, 1998).
2
implemented in particular places. A large part of social science involves performing individual
studies to learn useful bits of information, and another important aspect of social science is the
synthesis of these stylized facts into inferences about larger questions of understanding and
policy. Much of my own social science work has gone into trying to discover and quantify some
stylized facts about American politics, with various indirect policy implications which we
generally leave to others to explore.
Much of the discussion in the present volume is centered on the first of the approaches described
in the paragraph above, with the question framed as: What is the best design for estimating a
particular causal effect or causal structure of interest? Typical quantitative research, though,
goes in the other direction, giving us estimates of the effects of incumbency on elections, or the
effects of some distribution plan on attitudes, or the effect of a particular intervention on political
behavior, and so forth. Randomized experiments are the gold standard for this second kind of
study, but additional model-based quantitative analysis (as well as historical understanding,
theory, and qualitative analysis) is needed to get to the larger questions.
It would be tempting to split the difference in the present debate and say something like the
following: Randomized experiments give you accurate estimates of things you dont care about;
Observational studies give biased estimates of things that actually matter. The difficulty with
this formulation is that inferences from observational studies also have to be extrapolated to
correspond to the ultimate policy goals. Observational studies can be applied in many more
settings than experiments but they address the same sort of specific micro-questions. For all the
reasons given by Gerber, Green, and Kaplan, I think experiments really are a better choice when
we can do them, and I applaud the expansion in recent years of field experiments in a wide
variety of areas in political science, economics, sociology, and psychology.
I recommend we learn some lessons from the experience of educational researchers, who have
been running large experiments for decades and realize that, first, experiments give you a degree
of confidence that you can rarely get from an observational analysis; and, second, that the
mapping from any research findingexperimental or observationalis in effect an ongoing
conversation among models, data, and analysis.
My immediate practical message here is that, before considering larger implications, it can be
useful to think of the direct and specific implications of any study. This is clear for simple
studiesany estimated treatment effect can also be considered as a descriptive finding, the
difference between averages in treatment and control groups, among items that are otherwise
similar (as defined by the protocols of the study).
Direct summaries are also possible for more complicated designs. Consider, for example, the
Levitt (1997) study of policing and crime rates, which can be viewed as an estimate of the causal
effect of police on crime, using political cycles as an instrument, or, more directly, as an estimate
of the different outcomes that flow from the political cycle. Levitt found that, in cities with
mayoral election years, the number of police on the street goes up (compared to comparable cityyears without election) and the crime rate goes down. To me, this descriptive summary gives me
a sense of how the findings might generalize to other potential interventions. In particular, there
is always some political pressure to keep crime rates down, so the question might arise as to how
one might translate that pressure into putting police on the street even in non-election years.
For another example, historical evidence reveals that when the death penalty has been
implemented in the United States, crime rates have typically gone down. Studies have found this
at the national and the state levels. However, it is difficult to confidently attribute such declines
to the death penalty itself, as capital punishment is typically implemented in conjunction with
other crime-fighting measures such as increased police presence and longer prison sentences
(Donohue and Wolfers, 2006).
In many ways I find it helpful to focus on descriptive data summaries, which can reveal the
limits of unstated model assumptions. Much of the discussion of the death penalty in the popular
press as well as in the scholarly literature (not in the Donohue and Wolfers paper, but elsewhere)
seems to go to the incentives of potential murderers. But the death penalty also affects the
incentives of judges, juries, prosecutors, and so forth. One of the arguments in favor of the death
penalty is that it sends a message that the justice system is serious about prosecuting murders.
This message is sent to the population at large, I think, not just to deter potential murderers but to
make clear that the system works. Conversely, one argument against the death penalty is that it
motivates prosecutors to go after innocent people, and to hide or deny exculpatory evidence.
Lots of incentives out there. One of the advantages of thinking like a statisticianlooking at
what the data sayis that it gives you more flexibility later to think like a social scientist and
consider the big picture. With a narrow focus on causal inference, you can lose this.
Research designs
I welcome the present exchange on the pluses and minuses of social science experimentation but
I worry that the discussion is focused on a limited perspective on the possibilities of statistical
design and analysis.
In particular, I am concerned that experiment is taken to be synonymous with randomized
experiment. Here are some well-known designs which have some aspect of randomization or
experimentation without being full randomized experiments:
Natural experiments. From the Vietnam draft lottery to zillions of regression discontinuity
studies, we have many examples where a treatment was assigned not by a researcherthus, not
an experiment under the usual statistical definition of the termbut by some rule-based
process that can be mathematically modeled.
Non-randomized experiments. From the other direction, if a researcher assigns treatments
deterministically, it is still an experiment even if not randomized. What is relevant in the
subsequent statistical analyses are the factors influencing the selection (for a Bayesian treatment
of this issue, see Gelman et al., 2003, chapter 7).
Sampling. Textbook presentations often imply that the goal of causal inference is to learn about
the units who happen to be in the study. Invariably, though, these are a sample from a larger
population of interest. Even when the study appears to include the entire populationfor
Conclusions
In social science as in science in general, formal experiments (treatment assignment plus
measurement) teach us things that we could never observe from passive observation or informal
experimentation. I applaud the increasing spread of field experiments and recommend that
modern statistical methods be used in their design and analysis. It is an inefficiency we cannot
afford, and which shows insufficient respect for the participants in our surveys and experiments,
to use the simplest statistical methods just because we can. Even in the unlikely setting that
treatments have been assigned randomly according to plan and that there are no measurement
problems, there is no need to limit ourselves to simple comparisons and estimates of average
treatment effects.
In areas of design, measurement, and analysis, field experimenters can learn much from
researchers in sample surveys (for the problem of extending from sample to population which is
often brought up as a concern with experiments) and from researchers in observational studies
(for the problem of modeling complex interactions and response surfaces). And observational
researchersthat would be most empirical social scientists, including meshould try our best to
model biases and to connect our work to solid experimental research wherever possible.
References
Altman, L. K. (1986). Who Goes First: The Story of Self-Experimentation in Medicine.
Berkeley: University of California Press.
Barnard, J., Frangakis, C., E., Hill, J. L, and Rubin, D. B. (2003). Principal stratification
approach to broken randomized experiments: A case study of school choice vouchers in New
York City. Journal of the American Statistical Association 98, 299-323.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters. New York:
Wiley.
Dehejia, R. (2005). Does matching overcome LaLonde's critique of nonexperimental
estimators? A postscript. http://www.nber.org/%7Erdehejia/papers/postscript.pdf
Dehejia, R., and Wahba, S. (1999). Causal effects in non-experimental studies: Re-evaluating
the evaluation of training programs. Journal of the American Statistical Association 94, 10531062.
Donohue, J. J., and Wolfers, J. (2006). Uses and abuses of empirical evidence in the death
penalty debate. Stanford Law Review 58, 791-845.
Gelman, A. (2010). Causality and statistical learning. American Journal of Sociology, to appear.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis, second
edition. London: Chapman and Hall.
Gelman, A., and Little, T. C. (1997). Poststratification into many categories using hierarchical
logistic regression. Survey Methodology 23, 127-135.
Gelman, A., and Roberts, S. (2007). Weight loss, self-experimentation, and web trials: A
conversation. Chance 20 (4), 57-61.
Gelman, A., Stevens, M., and Chan, V. (2003). Regression models for decision making: A costbenefit analysis of incentives in telephone surveys. Journal of Business and Economic Statistics
21, 213-225.
Gerber, A. S., Green, D. P., and Kaplan, E. H. (2010). The illusion of learning from
observational research. In the present volume.
Greenland, S. (2005). Multiple-bias modeling for analysis of observational data (with
discussion). Journal of the Royal Statistical Society A 168, 267-306.
Imai, K., King, G., and Stuart, E. A. (2010). Misunderstandings between experimentalists and
observationalists about causal inference. In the present volume.
Lau, J., Ioannidis, J. P. A., and Schmid, C. H. (1998). Summing up evidence: one answer is not
always enough. Lancet 351, 123-127.
Lax, J., and Phillips, J. (2009). How should we estimate public opinion in the states? American
Journal of Political Science 53.
Levitt, S. D. (1997). Using electoral cycles in police hiring to estimate the effect of police on
crime. Amencan Economic Review 87, 270-90.
Roberts, S. (2004). Self-experimentation as a source of new ideas: Ten examples about sleep,
mood, health, and weight (with discussion). Behavioral and Brain Sciences 27, 227-288.
Rosenbaum, P. R. (2010). Design of Observational Studies. New York: Springer-Verlag.
Rubin, D. B. (1989). A new perspective on meta-analysis. In The Future of Meta-Analysis, eds.
K. W. Wachter and M. L. Straf. New York: Russell Sage Foundation.
Singer, E., Van Hoewyk, J., Gebler, N., Raghunathan, T., and McGonagle, K. (1999). The
effects of incentives on response rates in interviewer-mediated surveys. Journal of Official
Statistics 15.
Sloman, S. (2005). Causal Models: How People Think About the World and Its Alternatives.
Oxford University Press.
Smith, J., and Todd, P. E. (2005). Does matching overcome LaLonde's critique of
nonexperimental estimators? (with discussion). Journal of Econometrics 120, 305-375.