Documente Academic
Documente Profesional
Documente Cultură
STATISTICS OFFERS a widening range of powerful tools to help statistical methods should be chosen to minimize statistical
medical researchers attain full and accurate understanding of error in these estimates and tests. Finally, with the study
biological structure within their data. As research methodolo- completed, results need to be reported in a manner that ensures
gies continue to advance in sophistication, these data are that readers are as fully and accurately informed as possible.
becoming increasingly rich and complex, thereby requiring These five sources are conceptually distinguishable but inter-
increasingly thoughtful analysis. In response to these develop- related in practice. As will be illustrated, discussion of one
ments, funding sources, regulatory agencies, and journal edi- source often has implications for others.
tors are tightening their scrutiny of studies’ statistical designs, The objective of this essay is to illustrate each of these ten
analyses, and reporting. Compared with the past, statistical categories of statistical error and also provide some construc-
errors of any magnitude carry greater weight today in the tive suggestions as to how these errors can be minimized or
competition among scientists for research grants and in the corrected. This essay is not intended to provide detailed train-
publication of research findings. ing in statistical methods; rather, its purpose is to alert re-
This essay presents a framework to help researchers and searchers to potential statistical pitfalls. Details on statistical
reviewers identify ten categories of statistical errors. The methods that are pertinent to a researcher’s particular applica-
framework has two axes. The first axis recognizes two canon- tion can be found in texts (some of which are cited in this
ical types of statistical error: bias and imprecision. The second essay) or from consultation with a biostatistician. In this essay,
axis distinguishes five fundamental sources of statistical error: formal mathematical presentations have been kept to a mini-
sampling, measurement, estimation, hypothesis testing, and mum. Illustrative examples are drawn from review of a variety
reporting. of original-research articles published in this journal, American
Bias is error of consistent tendency in direction. For exam- Journal of Physiology-Endocrinology and Metabolism, and in
ple, an assay that consistently tends to underestimate concen- Endocrinology. Specific articles in the literature are not used as
trations of a metabolite is a biased assay. In contrast, impreci- examples, because the purpose of this essay is not to critique
sion is nondirectional noise. An imprecise assay may give true particular authors but instead to identify where improvement
readings on average, but those readings vary in value among may be possible in this literature as a whole.
repeats. CATEGORY I: SAMPLING BIAS
These two types of error arise from at least five sources.
Sampling error is error introduced by the process of selecting Sampling begins by identifying a specific population of
units (e.g., human subjects, mice, and so forth) from a popu- interest for study. Key to minimizing sampling bias is speci-
lation of interest. Once a unit has been drawn into the sample, ficity, because declaring interest in a broad study population
error can arise in measurements made on that unit. The re- (e.g., humans with type 2 diabetes) essentially guarantees
searcher often wishes to use these measured data to estimate sampling bias. To illustrate, if one truly wished to characterize
parameters (e.g., mean, median) or test hypotheses pertaining average peripheral-blood concentrations of some metabolite in
to the population from which units were drawn. In doing so, humans with type 2 diabetes, then all such humans would need
to be available for sampling and each assigned a known
probability of being selected for study. Not only is this im-
Address for reprint requests and other correspondence: T. H. Holmes,
Stanford Univ. School of Medicine, Dept. of Health Research and Policy, practical, because such a list could never be compiled, but it is
Division of Biostatistics, HRP Redwood Bldg., Rm. T160C, Stanford CA also unethical, because human subjects must be allowed to
94305-5405 (E-mail: tholmes@stanford.edu). volunteer for potential entry into research rather than being
http://www.ajpendo.org 0193-1849/04 $5.00 Copyright © 2004 the American Physiological Society E495
Downloaded from www.physiology.org/journal/ajpendo (213.196.113.090) on July 23, 2019.
Invited Review
E496 STATISTICAL ERRORS
randomly selected. In fact, because voluntary participation is observe treatment and control samples contemporaneously, not
the foundation of ethical research in human subjects, research in sequence. For example, in a two-arm randomized trial, both
on human subjects invariably contains self-selection bias. Al- arms should be running between the same start and end dates,
though self-selection bias may not be eliminated, it can be rather than running one arm first and the other arm later.
minimized. During study planning, the researcher simply de- Contemporaneous implementation can control for a number of
clares interest in subjects of that type who are likely to time-varying factors that may impact the outcome of interest,
volunteer and meet entry criteria; and, during study implemen- such as turnover in lab technicians and drift in instrument
tation, the researcher keeps dropout (and noncompliance) to a calibration.
minimum. In contrast to research in human subjects, with Inadequate randomizations are a form of sampling bias,
animal models samples are often drawn from bred populations, because the goal of randomization is to generate two (or more)
so that care must be taken, and apparently often is in the samples at baseline that are, on average, as homogenous as
endocrinology and metabolism (E and M) literature, to desig- possible on all factors that may influence outcome. Simple
nate during study planning the specific breeding line from randomizations alone do not guarantee this homogenization,
which a sampling will be obtained. especially when sample sizes are small, as is common in the E
Sampling need not be from one population. A researcher and M literature (see Category II, for example). The best
may wish to study several populations simultaneously. This corrective action is avoidance through use of more sophisti-
requires planning to avoid errors in parameter estimation and cated randomization methods, such as stratified or adaptive
hypothesis testing. In particular, if more than one population is designs (14). These include “permuted block” designs, in
of interest, sampling should be stratified by design (13). Here which the order of treatment assignments is randomly per-
we consider two potentially problematic types of mixed-pop- muted within small to moderately sized blocks of subjects of
ulation sampling that are found in the E and M literature: shared baseline characteristics. A much less desirable correc-
sampling from extremes and changes in protocol. tion, because it is remedial and not preventive, is to “statisti-
The E and M literature contains examples of deliberate cally adjust” for baseline differences during parameter estima-
sampling from the extremes. For example, a study in mice may tion by using, for example, appropriate multiple-regression
compare the 10% lightest and the 10% heaviest in a batch. The methods. The formulation of such regression models should be
limitation of this form of stratified sampling is that results specified in advance of any data review to avoid “data snoop-
obtained and conclusions drawn apply only to those extremes. ing” (Category VII). Despite their ease of use, these mathe-
Any interpolation may be biased if the extremes behave dif- matical adjustments to reduce estimation bias can never fully
ferently from intermediate levels, which can arise if, say, substitute for starting all arms of an experiment at approxi-
lightest mice possess medical complications that are unchar- mately equivalent compositions.
acteristic over the remainder of the weight range. A better basic Sampling error of consistent tendency in direction not only
design would be to sample at equal intervals over the entire can lead to biased estimates of “location” parameters (e.g.,
weight range. Atop this, one may concentrate sampling around means) but also can result in bias in parameters of any type,
any suspected within-range change points (e.g., a weight at including parameters of “dispersion” such as the variance.
which the slope on weight changes sharply or, say, average Sampling bias in dispersion can mislead understanding of how
response shifts abruptly up or down). heterogeneous a population truly is. Sampling bias in the
Mixed-population samples may also result from in-course variance from a pilot study can yield sample-size estimates that
changes in study protocol. For example, an injection schedule are too large or small for the confirmatory study. Overestimates
may be changed between subjects enrolled early during a study of variances result in unnecessary losses in power, whereas
and those enrolled later. Post hoc statistical analysis (e.g., underestimates increase chances that one will falsely conclude
regression analysis using covariates) may be attempted to that a treatment is actually effective, that a mean response truly
disentangle the effects of the protocol change from any treat- increases over time, and the like. A fundamental tenet to
ment differences, time trends, and the like that were of primary minimizing bias in variance estimates is to sample across those
interest in the original design of the study. However, no particular units to which conclusions are to apply. For instance,
guarantee can be made that statistical “correction” for in- suppose one wished to test for the response of patients to a
course changes in protocol will be fully effective, especially possible medical therapy that acts on pancreatic cells. One
when in-course protocol changes are strongly confounded with could conduct this test on multiple pancreatic cells drawn from
any effects of interest. The best strategy is to avoid in-course a single donor, but the resulting variance estimate would be an
protocol changes altogether. Any adjustments in protocol estimate of within-subject variance (wrong parameter) from a
should be made before the conduct of the full study, perhaps on single subject (potentially idiosyncratic), so that one could not
the basis of results of preliminary studies. use these results to draw conclusions of general use to thera-
Sampling bias can also arise in experimental studies in the pists. Instead, one would need to draw a sampling of cells from
form of an “inadequate control.” This is an example of bias due each of several donors, calculate mean cell response for each
to sampling from the wrong population. An ideal control donor, and then estimate the among-subject variance from
possesses all characteristics of the treatment condition that these subjects’ means.
impact response except for the putative mechanism under
study. Such ideals can be very difficult to obtain in practice CATEGORY II: SAMPLING IMPRECISION
(e.g., with assays in the presence of cross-reactivity or in
attempts to find “matching” controls in nonrandomized stud- Compared with other medical and scientific disciplines,
ies). One important safeguard against an inadequate control, many E and M research studies tend to use very small sample
which is not always followed in the E and M literature, is to sizes, with sample sizes ⬍10 being commonplace. Doubtless
AJP-Endocrinol Metab • VOL 286 • APRIL 2004 • www.ajpendo.org
Downloaded from www.physiology.org/journal/ajpendo (213.196.113.090) on July 23, 2019.
Invited Review
STATISTICAL ERRORS E497
practitioners use small samples because observations are dif- would deliberately stratify sampling on hypertension status. If
ficult and/or expensive to obtain. Some may also argue that stratified sampling were not designed into the study, some type
physiological characteristics are less variable than, say, clinical of poststratification could be performed after data collection;
or epidemiological characteristics and that, therefore, smaller however, this remedial approach risks large disparities in
sample sizes are permissible in most laboratory E and M sample size among strata, so that some strata may have small
studies. Although these arguments carry some weight, they sample sizes, which reduces precision and statistical power.
cannot justify the extremely small sample sizes that are so Instead, in advance of implementation of a study, the re-
common in this literature. Just how extremely small these searcher could identify any factors that may have a significant
samples are can be illustrated as follows. impact on outcome, form a parsimonious quantity of strata
Suppose we randomly draw a sample of size n from a from these factors, and then sample from each stratum. If the
population and calculate the sample arithmetic mean. Then we researcher wishes to test statistically for differences among
repeat this process many times. Our estimate of the sample these strata, then sample size could be controlled by study
mean will vary to some degree among our different samples. administrators to be nearly equal across all strata.
Traditionally, one way to characterize this sampling impreci-
sion in the mean is to estimate the standard deviation (i.e.,
CATEGORY III: MEASUREMENT BIAS
standard error) of the samples’ means. Fortunately, one can
estimate the standard error from a single sample as s/公n, In contrast to sampling errors, E and M research is clearly
where s is the sample standard deviation. Notice how the devoted to minimizing errors of measurement, as evidenced by
standard error depends inversely on sample size n, which the large proportion of published METHODS sections that address
makes sense intuitively, as one would expect precision of the issues of measurement (e.g., specimen handling and storage,
sample mean to increase as sample size increases. However, preparation of solutions and cells, measurement of fluxes, and
this relationship is not linear, in that the standard error is the like). Seemingly, enough detail is typically given to permit
proportional to the inverse of the square root of sample size. readers to reproduce all or most all of the reported measure-
As illustrated in Fig. 1, sample sizes ⬍10 tend to have large ment methods.
relative imprecision (1/公n) and thus yield unquestionably Nevertheless, measurement bias can creep into E and M
imprecise estimates of the mean. research data in subtle ways, often through some alteration that
The degree of sampling imprecision in the E and M literature takes place over the course of the study. Changes in reagent
could see major reductions if samples of size 15–30 were more manufacturers and turnover in device operators are just a
widely employed or required. For example, Fig. 1 shows that a
couple of possible examples. All such alterations should be
sample size of 5 has relative imprecision of ⬃0.45, whereas a
identified and tested to see if they introduce bias into measured
sample size of 20 has relative imprecision of ⬃0.22, which is
outcomes. For instance, a switch to a new source of reagent
a reduction in imprecision of ⬎50%.
should never be sequential; rather, if a switch is identified in
Imprecision can also result if sampling is from more than
one source population but not stratified by design (13). Mixed- advance, a series of test assays should be run simultaneously
population sampling can arise during recruitment in a hemo- on both reagents, and results compared. Key to testing an
dynamic study in which, for instance, some subjects present alteration is to recognize that one’s intent is to prove that the
with arterial hypertension and others do not. Because hyper- two conditions (e.g., reagents, operators) yield equivalent, not
tension may affect hemodynamic response, an efficient design different, results. As such, instead of standard two-sample
testing (e.g., with Student’s t-test), bioequivalency testing
should be performed in consultation with a statistician. In these
discussions, the researcher should come prepared to provide
the statistician with quantitative upper and lower bounds that
define the range of biological equivalence (e.g., ⫾1% differ-
ence between means of old and new reagents).
committed a Type II error. The complementary probability to icance level of P ⫽ 0.13. Authors are advised to avoid stating
Type II error is statistical power. That is, if we denote Type II in conclusive terms that no difference exists between the
error by , power is given by 1 ⫺ . Power is the probability populations, which is an overstatement (and thus directional
of rejecting the null hypothesis when the alternative is true. error). As an example, one may come across statements such as
Type II error can grow from any source of imprecision that “incubation with the inhibitor failed to alter the rate of meta-
arises in the process that leads up to hypothesis testing. Thus bolic transport.” A more accurate conclusion would be that “no
Type II error increases with 1) smaller sample sizes (common change in rate was detected statistically.” As mentioned above,
in the E and M literature), 2) larger technical error (category if one is seeking statistical support for the assertion that two
IV), or 3) less efficient estimators (category VI). groups do not differ, then bioequivalency testing should be
Beyond these sources, which have already been discussed, employed (see category III).
the E and M literature contains other examples of lost oppor-
tunities to enhance statistical power. An example is the use of CATEGORY X: REPORTING IMPRECISION
unequal sample sizes. In the E and M literature, one can find
Like the file drawer problem discussed above, reporting
statements such as “5–25 specimens were measured per group”
imprecision arises when useful information is withheld from
or “each group consisted of a minimum of 4 subjects.” Broadly
the reader, but here any resulting misunderstanding is not
speaking, when comparing groups’ means for a fixed total directional. An example from the E and M literature is when
sample size, power is greatest when sample sizes are equal for sample means are reported “⫾” a second value, but it is not
those groups (e.g., 15). indicated if the second value is a sample standard deviation
As indicated in the previous section, multiple testing is (estimate of the population standard deviation), standard error
common in the E and M literature, in part because studies (estimate of the standard deviation of the sample means), or
typically take measurement on more than one parameter for some other measure of dispersion. Identification of which
each subject. As a result, most characterizations of subjects are specific dispersion parameter has been estimated is necessary if
multivariate, which makes sense scientifically given that most other researchers are to use these estimates to plan sample sizes
metabolic and endocrine processes involve multiple parame- in future research. Another example found in this literature is
ters. Despite this, instances of multivariate hypothesis testing, when a result is presented along with an attained significance
in which one conducts tests on multiple parameters simulta- level (P value), but no clear information is provided on what
neously (e.g., Hotelling’s T2 test), are rare to nonexistent in the statistical method was employed to obtain that result. This
E and M literature. Instead, several univariate (single-param- leaves the reader with no means of assessing the value of the
eter) tests are conducted. The shortcoming of this wholly result of a hypothesis test.
univariate approach is that univariate hypothesis testing of
multivariate processes can sometimes result in a failure to RESULTS AND DISCUSSION
detect patterns of scientific interest. Rencher (10) provides an
introduction to multivariate hypothesis testing that is reason- This essay has illustrated ten categories of statistical errors.
ably accessible to nonstatisticians. It is intended to highlight potential statistical pitfalls for those
who are designing, implementing, and reviewing E and M
CATEGORY IX: REPORTING BIAS research. Some accessible references to the statistical literature
have been given throughout to provide an entry for those who
A form of reporting bias that may be undetectable to even are interested in learning more about the topics discussed.
the most astute reader is the failure to report. Sometimes A review of the recent E and M literature finds that the most
referred to as the “file drawer problem,” results go unpublished common potential cause of statistical error is small sample size
because an author or editor chooses not to publish “statistically (n ⱕ 10). Very small sample sizes 1) result in parameter
insignificant” findings. This generates bias in the literature. estimates that are unnecessarily imprecise, 2) enhance potential
Although it is true that underpowered studies are poor candi- for failed randomizations, and 3) yield hypothesis testing that
dates for funding on ethical (4) and statistical grounds, one is underpowered, or 4) yield hypothesis testing that is biased
may not wish to strictly equate statistical insignificance with because assumptions underlying the applied statistical methods
biological insignificance. For instance, an otherwise fully pow- could not be examined adequately. Missing data exacerbate
ered study may be worthy even if an underpowered portion this problem by further reducing and unbalancing sample size
lacks a statistically significant result, because that result may and, when “informative,” missing data can introduce bias. E
nevertheless provide an empirical suggestion, admittedly weak, and M research could also see major improvements through
for hypothesis generation. Such a hypothesis could be more more widespread use of stratified sampling designs, careful
rigorously tested in subsequent, fully powered studies. This is selection and implementation of experimental controls, use of
not to say, however, that the importance of findings from adaptive and stratified randomization procedures, routine re-
underpowered research should ever be overstated. On the porting of technical error, application of bioequivalency testing
flipside are studies designed with ample power that generate to studies designed to demonstrate equivalency, identification
negative findings. These too can serve an important role in the of statistically efficient means of data reduction through con-
literature by directing subsequent research away from appar- sultation with a statistician, more restrained use of one-tailed
ently fruitless avenues. tests, greater controls on Type I and Type II errors for hypoth-
Statistically insignificant results can also give rise to another esis tests across multiple parameters, fuller descriptions of
form of reporting bias. Suppose in testing a null hypothesis of statistical methods and methods of examining methodological
no difference between two populations against an alternative assumptions, and more frequent reporting on amply powered
claiming that their means differ, we obtain an attained signif- but statistically nonsignificant results.
AJP-Endocrinol Metab • VOL 286 • APRIL 2004 • www.ajpendo.org
Downloaded from www.physiology.org/journal/ajpendo (213.196.113.090) on July 23, 2019.
Invited Review
STATISTICAL ERRORS E501
Despite these limitations, the E and M literature appears to 2. Daniel WW. Applied Nonparametric Statistics (2nd ed.). Boston: PWS-
contain fewer statistical errors, in kind and quantity, than other Kent Publishing, 1990.
3. Diggle PJ, Liang K, and Zeger SL. Analysis of Longitudinal Data.
medical literatures examined by the author. In part, this may be Oxford: Clarendon, 1996.
for lack of opportunity, as the range of statistical methods 4. Halpern SD, Karlawish JHT, and Berlin JA. The continuing unethical
employed in the E and M literature is comparatively narrow. conduct of underpowered clinical trials. JAMA 288: 358–362, 2002.
This is a lost opportunity, as processes studied in E and M are 5. Hastie T, Tibshirani R, and Friedman J. The Elements of Statistical
typically high dimensional and time varying (e.g., electrolyte Learning. New York: Springer-Verlag, 2001.
6. Hochberg Y. A sharper Bonferroni procedure for multiple tests of signif-
concentration vs. time since initiation of dialysis), which icance. Biometrika 75: 800–802, 1988.
makes this discipline ripe for greater application of multivari- 7. Little RJA and Rubin DB. Statistical Analysis with Missing Data. New
ate (5, 10) and more flexible longitudinal (3) statistical designs York: Wiley, 1987.
and analyses. For example, generalized estimating equations 8. Milliken GA and Johnson DE. Analysis of Messy Data, Volume I:
(3) offer a highly flexible and robust method of analyzing Designed Experiments. New York: Chapman and Hall, 1992.
9. Neter J, Wasserman W, and Kutner MH. Applied Linear Statistical
repeated-measures data, especially when no data are missing; Models (2nd ed.). Homewood, IL: Irwin, 1985, p. 574.
dimensionality-reduction techniques, such as principal-compo- 10. Rencher AC. Methods of Multivariate Analysis. New York: Wiley, 1995.
nents analysis and cluster analysis (5), are useful for forming a 11. Rice WR. Analyzing tables of statistical tests. Int J Org Evolution 43:
few strata from several baseline characteristics; and these are 223–225, 1989.
just a few possibilities. The capacity for expanding the utility 12. Sokal RR and Rohlf FJ. Biometry (2nd ed.). New York: Freeman, 1981,
p. 59.
and sophistication of multivariate statistics in application to E 13. Thompson SK. Sampling. New York: Wiley, 1992, p.101–112.
and M research is tremendous. 14. Weir CJ and Lees KR. Comparison of stratification and adaptive meth-
ods for treatment allocation in an acute stroke clinical trial. Stat Med 22:
REFERENCES
705–726, 2003.
1. Carmer SG and Walker WM. Baby Bear’s dilemma—a statistical tale. 15. Woodward M. Formulae for sample size, power and minimum detectable
Agron J 74: 122–124, 1982. relative risk in medical studies. Statistician 41: 185–196, 1992.