Documente Academic
Documente Profesional
Documente Cultură
General Editor
Neil J. Salkind
University of Kansas
Associate Editors
Bruce B. Frey
University of Kansas
Donald M. Dougherty
University of Texas Health Science Center at San Antonio
Managing Editors
All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the
publisher.
For information:
SAGE Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: order@sagepub.com
HA29.E525 2010
001.403—dc22 2010001779
10 11 12 13 14 10 9 8 7 6 5 4 3 2 1
Volume 1
List of Entries vii
Reader’s Guide xiii
About the Editors xix
Contributors xxi
Introduction xxix
Entries
A 1 E 399
B 57 F 471
C 111 G 519
D 321
Volume 2
List of Entries vii
Entries
H 561 M 745
I 589 N 869
J 655 O 949
K 663 P 985
L 681
Volume 3
List of Entries vii
Entries
Q 1149 V 1589
R 1183 W 1611
S 1295 Y 1645
T 1489 Z 1653
U 1583
Index 1675
List of Entries
vii
viii List of Entries
The Reader’s Guide is provided to assist readers in locating entries on related topics. It classifies entries
into 28 general topical categories:
xiii
Contributors
xxi
Contributors xxiii
The Encyclopedia of Research Design is a collec- of topics that could be selected. We tried to select
tion of entries written by scholars in the field of those that are the most commonly used and that
research design, the discipline of how to plan and readers would find most useful and important to
conduct empirical research, including the use of have defined and discussed. At the same time, we
both quantitative and qualitative methods. A had to balance this selection with the knowledge
simple review of the Reader’s Guide shows how that there is never enough room to include every-
broad the field is, including such topics as descrip- thing. Terms were included because of a general
tive statistics, a review of important mathematical consensus that they were essential for such a work
concepts, a description and discussion of the as this.
importance of such professional organizations as Once the initial list of possible entries was
the American Educational Research Association defined in draft form, it was revised to produce the
and the American Statistical Association, the role set of categories and entries that you see in the
of ethics in research, important inferential proce- Reader’s Guide at the beginning of Volume 1. We
dures, and much more. Two topics are especially ultimately wanted topics that were sufficiently
interesting and set this collection of volumes apart technical to enlighten the naïve but educated
from similar works: (1) a review of important reader, and at the same time we wanted to avoid
research articles that have been seminal in the those topics from which only a small percentage of
field and have helped determine the direction of potential readers would benefit.
several ideas and (2) a review of popular tools As with many other disciplines, there is a great
(such as software) used to analyze results. This deal of overlap in terminology within research
collection of more than 500 entries includes cover- design, as well as across related disciplines. For
age of these topics and many more. example, the two relatively simple entries titled
Descriptive Statistics and Mean have much in com-
mon and necessarily cover some of the same con-
Process
tent (using different words because they were
The first step in the creation of the Encyclopedia written by different authors), but each entry also
of Research Design was the identification of peo- presents a different approach to understanding the
ple with the credentials and talent to perform cer- general topic of central tendency. More advanced
tain tasks. The associate editors were selected on topics such as Analysis of Variance and Repeated
the basis of their experience and knowledge in the Measures Design also have a significant number of
field of research design, and the managing editors conceptual ideas in common. It is impossible to
were selected for their experience in helping man- avoid overlap because all disciplines contain terms
age large projects. and ideas that are similar, which is what gives a
Once the editor selected the associate editors discipline its internal order—similar ideas and
and managing editors, the next step was for the such belong together. Second, offering different
group to work collectively to identify and select a language and explanations (but by no means iden-
thorough and complete listing of the important tical words) provides a more comprehensive and
topics in the area of research design. This was not varied view of important ideas. That is the strength
easy because there are hundreds, if not thousands, in the diversity of the list of contributors in the
xxix
xxx Introduction
Encyclopedia of Research Design and why it is the resubmitted, it was once again reviewed and, when
perfect instrument for new learners, as well as acceptable, passed on to production. Notably,
experienced researchers, to learn about new topics most entries were acceptable on initial submission.
or just brush up on new developments.
As we worked with the ongoing and revised
How to Use the
drafts of entries, we recruited authors to write the
Encyclopedia of Research Design
various entries. Part of the process of asking schol-
ars to participate included asking for their feed- The Encyclopedia of Research Design is a collection
back as to what should be included in the entry of entries intended for the naïve, but educated, con-
and what related topics should be included. The sumer. It is a reference tool for users who may be
contributors were given the draft entry list and interested in learning more about a particular research
were encouraged to suggest others ideas and direc- technique (such as “control group” or “reliability”).
tions to pursue. Many of their ideas and sugges- Users can search the Encyclopedia for specific
tions were useful, and often new entries were information or browse the Reader’s Guide to find
added to the list. Almost until the end of the entire topics of interest. For readers who want to pursue
process of writing entries, the entry list continued a topic further, each entry ends with both a list of
to be revised. related entries in the Encyclopedia and a set of
Once the list was finalized, we assigned each further readings in the literature, often including
one a specific length of 1,000, 2,000, or 3,000 online sources.
words. This decision was based on the importance
of the topic and how many words we thought
Acknowledgments
would be necessary to represent it adequately. For
example, the entry titled Abstract was deemed to As editor, I have had the pleasure of working as
be relatively limited, whereas we encouraged the the lead on several Sage encyclopedias. Because of
author of Reliability, an absolutely central topic to the complex nature of the topics included in the
research design, to write at least 3,000 words. As Encyclopedia of Research Design and the associ-
with every other step in the development of the ated difficulty writing about them, this was a par-
Encyclopedia of Research Design, we always ticularly challenging project. Many of the topics
allowed and encouraged authors to provide feed- are very complex and needed extra effort on the
back about the entries they were writing and part of the editors to identify how they might be
nearly always agreed to their requests. improved. Research design is a big and complex
The final step was to identify authors for each world, and it took a special effort to parse entries
of the 513 entries. We used a variety of mecha- down to what is contained in these pages, so a
nisms, including asking advisory board members great deal of thanks goes to Dr. Bruce Frey from
to identify scholars who were experts in a particu- the University of Kansas and Dr. Donald M.
lar area; consulting professional journals, books, Dougherty from the University of Texas Health
conference presentations, and other sources to Science Center at San Antonio for their diligence,
identify authors familiar with a particular topic; flexibility, talent, and passion for seeing this three-
and drawing on the personal contacts that the edi- volume set attain a very high standard.
torial board members have cultivated over many Our editors at Sage, Jim Brace-Thompson,
years of working in this field. If potential authors senior acquisitions editor, and Rolf Janke, vice
felt they could not participate, we asked them to president and publisher, SAGE Reference, do what
suggest someone who might be interested in writ- the best editors do: provide guidance and support
ing the entry. and leave us alone to do what we do best while
Once authors were confirmed, they were given they keep an eye on the entire process to be sure
explicit directions and deadlines for completing we do not go astray.
and submitting their entry. As the entries were sub- Kristin Teasdale and Nathalie Hill-Kapturczak
mitted, the editorial board of the encyclopedia read acted as managing editors and with great dedica-
them and, if necessary, requested both format and tion and professional skill managed to find authors,
substantive changes. Once a revised entry was see to it that documents were submitted on time,
Introduction xxxi
and track progress through the use of Sage’s elec- authors. They understood the task at hand was to
tronic tools. It is not an understatement that this introduce educated readers such as you to this
project would not have gotten done on time or run very broad field of research design. Without
as smoothly without their assistance. exception, they performed this task admirably.
The real behind-the-scenes heroes and heroines While reviewing submissions, we editors would
of this entire project are the editorial and produc- often find superb explications of difficult topics,
tion people at Sage who made sure that all the is and we became ever more pleased to be a part of
were dotted and the (Student) ts crossed. Among this important project.
them is Carole Mauer, senior developmental edi- And as always, we want to dedicate this
tor, who has been the most gentle of supportive encyclopedia to our loved ones—partners,
and constructive colleagues, always had the ans spouses, and children who are always there for
wers to countless questions, and guided us in the us and help us see the forest through the trees,
right directions. With Carole’s grace and opti- the bigger picture that makes good things
mism, we were ready to do what was best for the great.
project, even when the additional work made con-
siderable demands. Other people we would like to Neil J. Salkind, Editor
sincerely thank are Michele Thompson, Leticia M. University of Kansas
Gutierrez, Laura Notton, Kate Schroeder, Bonnie
Freeman, Liann Lech, and Sheree Van Vreede, all Bruce B. Frey, Associate Editor
of whom played a major role in seeing this set of
University of Kansas
volumes come to fruition. It is no exaggeration
that what you see here would not have been pos-
sible without their hard work. Donald M. Dougherty, Associate Editor
Of course this encyclopedia would not exist University of Texas Health Science Center at
without the unselfish contributions of the many San Antonio
A
prediction concerning the correlations of the sub-
ABSTRACT scales could be confirmed, suggesting high validity.
No statistically significant negative association
was observed between the Black nationalist and
An abstract is a summary of a research or a review assimilationist ideology subscales. This result is
article and includes critical information, including discussed as a consequence of the specific social
a complete reference to the work, its purpose, context Black Germans live in and is not consid-
methods used, conclusions reached, and implica- ered to lower the MIBI’s validity. Observed differ-
tions. For example, here is one such abstract from ences in mean scores to earlier studies of African
the Journal of Black Psychology authored by Timo American racial identity are also discussed.
Wandert from the University of Mainz, published
in 2009 and titled ‘‘Black German Identities: Vali-
dating the Multidimensional Inventory of Black Abstracts serve several purposes. First, they
Identity.’’ provide a quick summary of the complete pub-
All the above-mentioned elements are included lication that is easily accessible in the print
in this abstract: the purpose, a brief review of form of the article or through electronic
important ideas to put the purpose into a context, means. Second, they become the target for
the methods, the results, and the implications of search tools and often provide an initial
the results. screening when a researcher is doing a litera-
ture review. It is for this reason that article
This study examines the reliability and validity of titles and abstracts contain key words that one
a German version of the Multidimensional Inven- would look for when searching for such infor-
tory of Black Identity (MIBI) in a sample of 170 mation. Third, they become the content of
Black Germans. The internal consistencies of all reviews or collections of abstracts such as Psy-
subscales are at least moderate. The factorial cINFO, published by the American Psychologi-
structure of the MIBI, as assessed by principal cal Association (APA). Finally, abstracts
component analysis, corresponds to a high degree sometimes are used as stand-ins for the actual
to the supposed underlying dimensional structure. papers when there are time or space limita-
Construct validity was examined by analyzing tions, such as at professional meetings. In this
(a) the intercorrelations of the MIBI subscales and instance, abstracts are usually presented as
(b) the correlations of the subscales with external posters in presentation sessions.
variables. Predictive validity was assessed by ana- Most scholarly publications have very clear
lyzing the correlations of three MIBI subscales guidelines as to how abstracts are to be created,
with the level of intra-racial contact. All but one prepared, and used. For example, the APA, in the
1
2 Accuracy in Parameter Estimation
Publication Manual of the American Psychological obtaining narrow confidence intervals. The stan-
Association, provides information regarding the dard AIPE approach yields the necessary sample
elements of a good abstract and suggestions for size so that the expected width of a confidence
creating one. While guidelines for abstracts of interval will be sufficiently narrow. Because confi-
scholarly publications (such as print and electronic dence interval width is a random variable based
journals) tend to differ in the specifics, the follow- on data, the actual confidence interval will almost
ing four guidelines apply generally: certainly differ from (e.g., be larger or smaller
than) the expected confidence interval width. A
1. The abstract should be short. For example, APA modified AIPE approach allows sample size to be
limits abstracts to 250 words, and MEDLINE planned so that there will be some desired degree
limits them to no more than 400 words. The of assurance that the observed confidence interval
abstract should be submitted as a separate page.
will be sufficiently narrow. The standard AIPE
2. The abstract should appear as one unindented approach addresses questions such as what size
paragraph. sample is necessary so that the expected width of
3. The abstract should begin with an introduction the 95% confidence interval width will be no
and then move to a very brief summary of the larger than ω, where ω is the desired confidence
method, results, and discussion. interval width. However, the modified AIPE
approach addresses questions such as what size
4. After the abstract, five related keywords should
sample is necessary so that there is γ 100% assur-
be listed. These keywords help make electronic
searches efficient and successful. ance that the 95% confidence interval width will
be no larger than ω, where γ is the desired value
With the advent of electronic means of creating of the assurance parameter.
and sharing abstracts, visual and graphical abstracts Confidence interval width is a way to operation-
have become popular, especially in disciplines in alize the accuracy of the parameter estimate, holding
which they contribute to greater understanding by everything else constant. Provided appropriate
the reader. assumptions are met, a confidence interval consists
of a set of plausible parameter values obtained from
Neil J. Salkind applying the confidence interval procedure to data,
where the procedure yields intervals such that
See also American Psychological Association Style; Ethics (1 α)100% will correctly bracket the population
in the Research Process; Literature Review parameter of interest, where 1 α is the desired con-
fidence interval coverage. Holding everything else
Further Readings constant, as the width of the confidence interval
American Psychological Association. (2009). Publication decreases, the range of plausible parameter values is
Manual of the American Psychological Association narrowed, and thus more values can be excluded as
(6th ed.). Washington, DC: Author. implausible values for the parameter. In general,
Fletcher, R. H. (1988). Writing an abstract. Journal of whenever a parameter value is of interest, not only
General Internal Medicine, 3(6), 607–609. should the point estimate itself be reported, but so
Luhn, H. P. (1999). The automatic creation of literature too should the corresponding confidence interval for
abstracts. In I. Mani & M. T. Maybury (Eds.), the parameter, as it is known that a point estimate
Advances in automatic text summarization (pp.
almost certainly differs from the population value
15–21). Cambridge: MIT Press.
and does not give an indication of the degree of
uncertainty with which the parameter has been esti-
mated. Wide confidence intervals, which illustrate
ACCURACY IN the uncertainty with which the parameter has been
estimated, are generally undesirable. Because the
PARAMETER ESTIMATION direction, magnitude, and accuracy of an effect can
be simultaneously evaluated with confidence inter-
Accuracy in parameter estimation (AIPE) is an vals, it has been argued that planning a research
approach to sample size planning concerned with study in an effort to obtain narrow confidence
Accuracy in Parameter Estimation 3
intervals is an ideal way to improve research findings mean difference was in fact exactly 0.50, the 95%
and increase the cumulative knowledge of confidence interval has a lower and upper limit of
a discipline. .147 and .851, respectively. Thus, the lower confi-
Operationalizing accuracy as the observed dence limit is smaller than ‘‘small’’ and the upper
confidence interval width is not new. In fact, confidence limit is larger than ‘‘large.’’ Although
writing in the 1930s, Jerzy Neyman used the there was enough statistical power (recall that sam-
confidence interval width as a measure of accu- ple size was planned so that power ¼.80, and
racy in his seminal work on the theory of confi- indeed, the null hypothesis of no group mean differ-
dence intervals, writing that the accuracy of ence was rejected, p ¼ .005), in this case sample size
estimation corresponding to a fixed value of was not sufficient from an accuracy perspective, as
1 α may be measured by the length of the con- illustrated by the wide confidence interval.
fidence interval. Statistically, accuracy is defined Historically, confidence intervals were not often
as the square root of the mean square error, reported in applied research in the behavioral, edu-
which is a function of precision and bias. When cational, and social sciences, as well as in many
the bias is zero, accuracy and precision are other domains. Cohen once suggested researchers
equivalent concepts. The AIPE approach is so failed to report confidence intervals because their
named because its goal is to improve the overall widths were ‘‘embarrassingly large.’’ In an effort to
accuracy of estimates, and not just the precision plan sample size so as not to obtain confidence
or bias alone. Precision can often be improved at intervals that are embarrassingly large, and in fact
the expense of bias, which may or may not to plan sample size so that confidence intervals are
improve the accuracy. Thus, so as not to obtain sufficiently narrow, the AIPE approach should be
estimates that are sufficiently precise but possi- considered. The argument for planning sample size
bly more biased, the AIPE approach sets its goal from an AIPE perspective is based on the desire to
of obtaining sufficiently accurate parameter esti- report point estimates and confidence intervals
mates as operationalized by the width of the cor- instead of or in addition to the results of null
responding (1 α)100% confidence interval. hypothesis significance tests. This paradigmatic
Basing important decisions on the results of shift has led to AIPE approaches to sample size
research studies is often the goal of the study. How- planning becoming more useful than was previ-
ever, when an effect has a corresponding confidence ously the case, given the emphasis now placed on
interval that is wide, decisions based on such effect confidence intervals instead of a narrow focus on
sizes need to be made with caution. It is entirely the results of null hypothesis significance tests.
possible for a point estimate to be impressive Whereas the power analytic approach to sample
according to some standard, but for the confidence size planning has as its goal the rejection of a false
limits to illustrate that the estimate is not very accu- null hypothesis with some specified probability,
rate. For example, a commonly used set of guide- the AIPE approach is not concerned with whether
lines for the standardized mean difference in the some specified null value can be rejected (i.e., is
behavioral, educational, and social sciences is that the null value outside the confidence interval lim-
population standardized effect sizes of 0.2, 0.5, and its?), making it fundamentally different from the
0.8 are regarded as small, medium, and large power analytic approach. Not surprisingly, the
effects, respectively, following conventions estab- AIPE and power analytic approaches can suggest
lished by Jacob Cohen beginning in the 1960s. very different values for sample size, depending on
Suppose that the population standardized mean dif- the particular goals (e.g., desired width or desired
ference is thought to be medium (i.e., 0.50), based power) specified. The AIPE approach to sample
on an existing theory and a review of the relevant size planning is able to simultaneously consider the
literature. Further suppose that a researcher direction of an effect (which is what the null
planned the sample size so that there would be hypothesis significance test provides), its magni-
a statistical power of .80 when the Type I error rate tude (best and worst case scenarios based on the
is set to .05, which yields a necessary sample size of values of the confidence limits), and the accuracy
64 participants per group (128 total). In such a situ- with which the population parameter was esti-
ation, supposing that the observed standardized mated (via the width of the confidence interval).
4 Action Research
The term accuracy in parameter estimation (and through several cycles of action. The most common
the acronym AIPE) was first used by Ken Kelley purpose of action research is to guide practitioners
and Scott E. Maxwell in 2003 with an argument as they seek to uncover answers to complex pro-
given for its widespread use in lieu of or in addition blems in disciplines such as education, health
to the power analytic approach. However, the gen- sciences, sociology, or anthropology. Action research
eral idea of AIPE has appeared in the literature spo- is typically underpinned by ideals of social justice
radically since at least the 1960s. James Algina, as and an ethical commitment to improve the quality
well as Stephen Olejnik and Michael R. Jiroutek, of life in particular social settings. Accordingly, the
contributed to similar approaches. The goal of the goals of action research are as unique to each study
approach suggested by Algina is to have an esti- as participants’ contexts; both determine the type of
mate sufficiently close to its corresponding popula- data-gathering methods that will be used. Because
tion value, and the goal suggested by Olejnik and action research can embrace natural and social sci-
Jiroutek is to simultaneously have a sufficient ence methods of scholarship, its use is not limited to
degree of power and confidence interval narrow- either positivist or heuristic approaches. It is, as John
ness. Currently, the most extensive program for Dewey pointed out, an attitude of inquiry rather
planning sample size from the AIPE perspective is than a single research methodology.
R using the MBESS package. This entry presents a brief history of action
research, describes several critical elements of
Ken Kelley action research, and offers cases for and against
the use of action research.
See also Confidence Intervals; Effect Size, Measures of;
Power Analysis; Sample Size Planning
Historical Development
Further Readings Although not officially credited with authoring
the term action research, Dewey proposed five
Cohen, J. (1988). Statistical power analysis for the
behavioral sciences (2nd ed). Hillsdale, NJ: Lawrence phases of inquiry that parallel several of
Erlbaum. the most commonly used action research pro-
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). cesses, including curiosity, intellectualization,
Sample size planning for statistical power and hypothesizing, reasoning, and testing hypotheses
accuracy in parameter estimation. Annual Review of through action. This recursive process in scien-
Psychology, 59, 537–563. tific investigation is essential to most contempo-
Thompson, B. (2002). What future quantitative social rary action research models. The work of
science research could look like: Confidence intervals Kurt Lewin is often considered seminal in estab-
for effect sizes. Educational Researcher, 31, 25–32.
lishing the credibility of action research. In
anthropology, William Foote Whyte conducted
early inquiry using an action research process
similar to Lewin’s. In health sciences, Reginald
ACTION RESEARCH Revans renamed the process action learning
while observing a process of social action among
Action research differs from conventional research nurses and coal miners in the United Kingdom.
methods in three fundamental ways. First, its pri- In the area of emancipatory education, Paulo
mary goal is social change. Second, members of the Freire is acknowledged as one of the first to
study sample accept responsibility for helping undertake action research characterized by par-
resolve issues that are the focus of the inquiry. ticipant engagement in sociopolitical activities.
Third, relationships between researcher and study The hub of the action research movement shifted
participants are more complex and less hierarchical. from North America to the United Kingdom in the
Most often, action research is viewed as a process late 1960s. Lawrence Stenhouse was instrumental in
of linking theory and practice in which scholar- revitalizing its use among health care practitioners.
practitioners explore a social situation by posing John Elliott championed a form of educational
a question, collecting data, and testing a hypothesis action research in which the researcher-as-participant
Action Research 5
takes increased responsibility for individual and col- observing, and reflecting recur during an action
lective changes in teaching practice and school research study. Iterancy, as a unique and critical
improvement. Subsequently, the 1980s were witness characteristic, can be attributed to Lewin’s early
to a surge of action research activity centered in conceptualization of action research as involving
Australia. Wilfred Carr and Stephen Kemmis hypothesizing, planning, fact-finding (reconnais-
authored Becoming Critical, and Kemmis and Robin sance), execution, and analysis (see Figure 1).
McTaggart’s The Action Research Planner informed These iterations comprise internal and external
much educational inquiry. Carl Glickman is often repetition referred to as learning loops, during
credited with a renewed North American interest in which participants engage in successive cycles of
action research in the early 1990s. He advocated collecting and making sense of data until agree-
action research as a way to examine and implement ment is reached on appropriate action. The result
principles of democratic governance; this interest is some form of human activity or tangible docu-
coincided with an increasing North American appe- ment that is immediately applicable in partici-
tite for postmodern methodologies such as personal pants’ daily lives and instrumental in informing
inquiry and biographical narrative. subsequent cycles of inquiry.
Characteristics Collaboration
Reflection Action research methods have evolved to include
Focused reflection is a key element of most collaborative and negotiatory activities among vari-
action research models. One activity essential to ous participants in the inquiry. Divisions between
reflection is referred to as metacognition, or the roles of researchers and participants are fre-
thinking about thinking. Researchers ruminate quently permeable; researchers are often defined as
on the research process even as they are perform- both full participants and external experts who
ing the very tasks that have generated the prob- engage in ongoing consultation with participants.
lem and, during their work, derive solutions Criteria for collaboration include evident structures
from an examination of data. Another aspect for sharing power and voice; opportunities to con-
of reflection is circumspection, or learning-in- struct common language and understanding among
practice. Action research practitioners typically partners; an explicit code of ethics and principles;
proceed through various types of reflection, agreement regarding shared ownership of data; pro-
including those that focus on technical proficien- visions for sustainable community involvement and
cies, theoretical assumptions, or moral or ethical action; and consideration of generative methods to
issues. These stages are also described as learn- assess the process’s effectiveness.
ing for practice, learning in practice, and learn- The collaborative partnerships characteristic of
ing from practice. Learning for practice involves action research serve several purposes. The first is to
the inquiry-based activities of readiness, aware- integrate into the research several tenets of evidence-
ness, and training engaged in collaboratively by based responsibility rather than documentation-
the researcher and participants. Learning in based accountability. Research undertaken for pur-
practice includes planning and implementing poses of accountability and institutional justification
intervention strategies and gathering and making often enforces an external locus of control. Con-
sense of relevant evidence. Learning from prac- versely, responsibility-based research is characterized
tice includes culminating activities and planning by job-embedded, sustained opportunities for parti-
future research. Reflection is integral to the cipants’ involvement in change; an emphasis on the
habits of thinking inherent in scientific explora- demonstration of professional learning; and fre-
tions that trigger explicit action for change. quent, authentic recognition of practitioner growth.
1 2 3
GENERAL action action action
IDEA PLAN step step step
decision decision
decision about 2 about 3
about 1
reconnaissance
reconnaissance reconnaissance
of goals and
of results of results
means
reconnaissance of
results might indicate
change in general plan
with participants. In a complete participant role, The learning by the participants and by the
the identity of the researcher is neither concealed researcher is rarely mutually exclusive; moreover, in
nor disguised. The researchers’ and participants’ practice, action researchers are most often full
goals are synonymous; the importance of partici- participants.
pants’ voice heightens the necessity that issues of Intertwined purpose and the permeability of
anonymity and confidentiality are the subject of roles between the researcher and the participant
ongoing negotiation. The participant observer role are frequently elements of action research studies
encourages the action researcher to negotiate levels with agendas of emancipation and social justice.
of accessibility and membership in the participant Although this process is typically one in which the
group, a process that can limit interpretation of external researcher is expected and required to
events and perceptions. However, results derived provide some degree of expertise or advice, partici-
from this type of involvement may be granted pants—sometimes referred to as internal research-
a greater degree of authenticity if participants are ers—are encouraged to make sense of, and apply,
provided the opportunity to review and revise per- a wide variety of professional learning that can be
ceptions through a member check of observations translated into ethical action. Studies such as these
and anecdotal data. A third possible role in action contribute to understanding the human condition,
research is the observer participant, in which the incorporate lived experience, give public voice to
researcher does not attempt to experience the experience, and expand perspectives of participant
activities and events under observation but negoti- and researcher alike.
ates permission to make thorough and detailed
notes in a fairly detached manner. A fourth role,
A Case for and Against Action Research
less common to action research, is that of the com-
plete observer, in which the researcher adopts pas- Ontological and epistemological divisions between
sive involvement in activities or events, and qualitative and quantitative approaches to research
a deliberate—often physical—barrier is placed abound, particularly in debates about the credibility
between the researcher and the participant in order of action research studies. On one hand, quantita-
to minimize contamination. These categories only tive research is criticized for drawing conclusions
hint at the complexity of roles in action research. that are often pragmatically irrelevant; employing
Action Research 7
methods that are overly mechanistic, impersonal, • The complexity of social interactions makes
and socially insensitive; compartmentalizing, and other research approaches problematic.
thereby minimizing, through hypothetico-deductive • Theories derived from positivist educational
schemes, the complex, multidimensional nature of research have been generally inadequate in
human experiences; encouraging research as an iso- explaining social interactions and cultural
phenomena.
lationist and detached activity void of, and impervi- • Increased public examination of public
ous to, interdependence and collaboration; and institutions such as schools, hospitals, and
forwarding claims of objectivity that are simply not corporate organizations requires insights of
fulfilled. a type that other forms of research have not
On the other hand, qualitative aspects of provided.
action research are seen as quintessentially unre- • Action research can provide a bridge across the
liable forms of inquiry because the number of perceived gap in understanding between
uncontrolled contextual variables offers little practitioners and theorists.
certainty of causation. Interpretive methodolo-
gies such as narration and autobiography can Reliability and Validity
yield data that are unverifiable and potentially The term bias is a historically unfriendly pejora-
deceptive. Certain forms of researcher involve- tive frequently directed at action research. As
ment have been noted for their potential to much as possible, the absence of bias constitutes
unduly influence data, while some critiques con- conditions in which reliability and validity can
tend that Hawthorne or halo effects—rather increase. Most vulnerable to charges of bias are
than authentic social reality—are responsible for action research inquiries with a low saturation
the findings of naturalist studies. point (i.e., a small N), limited interrater reliability,
Increased participation in action research in the and unclear data triangulation. Positivist studies
latter part of the 20th century paralleled a growing make attempts to control external variables that
demand for more pragmatic research in all fields may bias data; interpretivist studies contend that it
of social science. For some humanities practi- is erroneous to assume that it is possible to do any
tioners, traditional research was becoming irrele- research—particularly human science research—
vant, and their social concerns and challenges that is uncontaminated by personal and political
were not being adequately addressed in the find- sympathies and that bias can occur in the labora-
ings of positivist studies. They found in action tory as well as in the classroom. While value-free
research a method that allowed them to move inquiry may not exist in any research, the critical
further into other research paradigms or to com- issue may not be one of credibility but, rather, one
mit to research that was clearly bimethodological. of recognizing divergent ways of answering ques-
Increased opportunities in social policy develop- tions associated with purpose and intent. Action
ment meant that practitioners could play a more research can meet determinants of reliability and
important role in conducting the type of research validity if primary contextual variables remain
that would lead to clearer understanding of social consistent and if researchers are as disciplined as
science phenomena. Further sociopolitical impetus possible in gathering, analyzing, and interpreting
for increased use of action research derived from the evidence of their study; in using triangulation
the politicizing effects of the accountability move- strategies; and in the purposeful use of participa-
ment and from an increasing solidarity in humani- tion validation. Ultimately, action researchers must
ties professions in response to growing public reflect rigorously and consistently on the places
scrutiny. and ways that values insert themselves into studies
The emergence of action research illustrates and on how researcher tensions and contradictions
a shift in focus from the dominance of statistical can be consistently and systematically examined.
tests of hypotheses within positivist paradigms
toward empirical observations, case studies, and
Generalizability
critical interpretive accounts. Research protocols
of this type are supported by several contentions, Is any claim of replication possible in studies
including the following: involving human researchers and participants?
8 Action Research
Perhaps even more relevant to the premises and • How are issues of representation, validity, bias,
intentions that underlie action research is the and reliability discussed?
question, Is this desirable in contributing to our • What is the role of the research? In what ways
understanding of the social world? Most action does this align with the purpose of the study?
• In what ways will this study contribute to
researchers are less concerned with the traditional
knowledge and understanding?
goal of generalizability than with capturing the
richness of unique human experience and meaning.
A defensible understanding of what constitutes
Capturing this richness is often accomplished
knowledge and of the accuracy with which it is
by reframing determinants of generalization and
portrayed must be able to withstand reasonable
avoiding randomly selected examples of human
scrutiny from different perspectives. Given the
experience as the basis for conclusions or extrapo-
complexities of human nature, complete under-
lations. Each instance of social interaction, if
standing is unlikely to result from the use of a
thickly described, represents a slice of the social
single research methodology. Ethical action
world in the classroom, the corporate office, the
researchers will make public the stance and lenses
medical clinic, or the community center. A certain
they choose for studying a particular event. With
level of generalizability of action research results
transparent intent, it is possible to honor the
may be possible in the following circumstances:
unique, but not inseparable, domains inhabited by
• Participants in the research recognize and
social and natural, thereby accommodating appre-
confirm the accuracy of their contributions.
ciation for the value of multiple perspectives of the
• Triangulation of data collection has been human experience.
thoroughly attended to.
• Interrater techniques are employed prior to
drawing research conclusions. Making Judgment on Action Research
• Observation is as persistent, consistent, and
longitudinal as possible. Action research is a relatively new addition to the
• Dependability, as measured by an auditor, repertoire of scientific methodologies, but its appli-
substitutes for the notion of reliability. cation and impact are expanding. Increasingly
• Confirmability replaces the criterion of sophisticated models of action research continue
objectivity. to evolve as researchers strive to more effectively
capture and describe the complexity and diversity
Ethical Considerations of social phenomena.
One profound moral issue that action research- Perhaps as important as categorizing action
ers, like other scientists, cannot evade is the use research into methodological compartments is the
they make of knowledge that has been generated necessity for the researcher to bring to the study
during inquiry. For this fundamental ethical rea- full self-awareness and disclosure of the personal
son, the premises of any study—but particularly and political voices that will come to bear on
those of action research—must be transparent. results and action. The action researcher must
Moreover, they must attend to a wider range of reflect on and make transparent, prior to the study,
questions regarding intent and purpose than sim- the paradoxes and problematics that will guide the
ply those of validity and reliability. These ques- inquiry and, ultimately, must do everything that is
tions might include considerations such as the fair and reasonable to ensure that action research
following: meets requirements of rigorous scientific study.
Once research purpose and researcher intent are
• Why was this topic chosen? explicit, several alternative criteria can be used to
• How and by whom was the research funded? ensure that action research is sound research.
• To what extent does the topic dictate or align These criteria include the following types, as noted
with methodology? by David Scott and Robin Usher:
• Are issues of access and ethics clear?
• From what foundations are the definitions of Aparadigmatic criteria, which judge natural and
science and truth derived? social sciences by the same strategies of data
Adaptive Designs in Clinical Trials 9
collection and which apply the same determinants Carr, W., & Kemmis, S. (1986). Becoming critical:
of reliability and validity Education, knowledge and action research.
Philadelphia: Farmer.
Diparadigmatic criteria, which judge social
Dewey, J. (1910). How we think. Boston: D. C. Heath.
phenomena research in a manner that is Freire, P. (1968). Pedagogy of the oppressed. New York:
dichotomous to natural science events and which Herder & Herder.
apply determinants of reliability and validity that Habermas, J. (1971). Knowledge and human interests.
are exclusive to social science Boston: Beacon.
Multiparadigmatic criteria, which judge research of Holly, M., Arhar, J., & Kasten, W. (2005). Action
the social world through a wide variety of research for teachers: Traveling the yellow brick road.
strategies, each of which employs unique Upper Saddle River, NJ: Pearson/Merrill/Prentice Hall,
postmodern determinants of social science Kemmis, S., & McTaggart, R. (1988). The action
research planner. Geelong, Victoria, Australia: Deakin
Uniparadigmatic criteria, which judge the natural University.
and social world in ways that are redefined and Lewin, K. (1946). Action research and minority
reconceptualized to align more appropriately with problems. Journal of Social Issues, 2, 34–46.
a growing quantity and complexity of knowledge Revans, R. (1982). The origins and growth of action
learning. Bromley, UK: Chartwell-Bratt.
In the final analysis, action research is favored Sagor, R. (1992). How to conduct collaborative action
by its proponents because it research. Alexandria, VA: Association for Supervision
and Curriculum Development.
• honors the knowledge and skills of all participants Schön, D. (1983). The reflective practitioner. New York:
• allows participants to be the authors of their Basic Books.
own incremental progress
• encourages participants to learn strategies of
problem solving
• promotes a culture of collaboration
• enables change to occur in context
ADAPTIVE DESIGNS
• enables change to occur in a timely manner IN CLINICAL TRIALS
• is less hierarchical and emphasizes collaboration
• accounts for rather than controls phenomena
Some designs for clinical trial research, such as
drug effectiveness research, allow for modification
Action research is more than reflective practice. and make use of an adaptive design. Designs such
It is a complex process that may include either qual- as adaptive group-sequential design, n-adjustable
itative or quantitative methodologies, one that has design, adaptive seamless phase II–III design, drop-
researcher and participant learning at its center. the-loser design, adaptive randomization design,
Although, in practice, action research may not often adaptive dose-escalation design, adaptive treat-
result in high levels of critical analysis, it succeeds ment-switching design, and adaptive-hypothesis
most frequently in providing participants with intel- design are adaptive designs.
lectual experiences that are illuminative rather than In conducting clinical trials, investigators first
prescriptive and empowering rather than coercive. formulate the research question (objectives) and
Pamela Adams then plan an adequate and well-controlled study
that meets the objectives of interest. Usually, the
See also Evidence-Based Decision Making; External objective is to assess or compare the effect of one
Validity; Generalizability Theory; Mixed Methods or more drugs on some response. Important steps
Design; Naturalistic Inquiry involved in the process are study design, method
of analysis, selection of subjects, assignment of
subjects to drugs, assessment of response, and
Further Readings assessment of effect in terms of hypothesis testing.
Berg, B. (2001). Qualitative research methods for the All the above steps are outlined in the study proto-
social sciences. Toronto, Ontario, Canada: Allyn and col, and the study should follow the protocol to
Bacon. provide a fair and unbiased assessment of the
10 Adaptive Designs in Clinical Trials
treatment effect. However, it is not uncommon to switch a patient’s treatment from an initial
adjust or modify the trial, methods, or both, either assignment to an alternative treatment because of
at the planning stage or during the study, to pro- a lack of efficacy or a safety concern.
vide flexibility in randomization, inclusion, or Adaptive-Hypothesis Design. Adaptive-hypothesis
exclusion; to allow addition or exclusion of doses; design allows change in research hypotheses based
to extend treatment duration; or to increase or on interim analysis results.
decrease the sample size. These adjustments are
mostly done for one or more of the following rea-
sons: to increase the probability of success of the
trial; to comply with budget, resource, or time Sample Size
constraints; or to reduce concern for safety. How- There has been considerable research on adaptive
ever, these modifications must not undermine the designs in which interim data at first stage are used
validity and integrity of the study. This entry to reestimate overall sample size. Determination of
defines various adaptive designs and discusses the sample size for a traditional randomized clinical
use of adaptive designs for modifying sample size. trial design requires specification of a clinically
meaningful treatment difference, to be detected
Adaptive Design Variations with some desired power. Such determinations can
become complicated because of the need for speci-
Adaptive design of a clinical trial is a design that fying nuisance parameters such as the error vari-
allows adaptation of some aspects of the trial after ance, and the choice for a clinically meaningful
its initiation without undermining the trial’s valid- treatment difference may not be straightforward.
ity and integrity. There are variations of adaptive However, adjustment of sample size with proper
designs, as described in the beginning of this entry. modification of Type I error may result in an over-
Here is a short description of each variation: powered study, which wastes resources, or an
underpowered study, with little chance of success.
Adaptive Group-Sequential Design. Adaptive A traditional clinical trial fixes the sample size
group-sequential design allows premature
in advance and performs the analysis after all sub-
termination of a clinical trial on the grounds of
jects have been enrolled and evaluated. The advan-
safety, efficacy, or futility, based on interim results.
tages of an adaptive design over classical designs
n-Adjustable Design. Adaptive n-adjustable design are that adaptive designs allow design assumptions
allows reestimation or adjustment of sample size, (e.g., variance, treatment effect) to be modified on
based on the observed data at interim. the basis of accumulating data and allow sample
Adaptive Seamless Phase II–III Design. Such size to be modified to avoid an under- or overpow-
a design addresses, within a single trial, objectives ered study. However, researchers have shown that
that are normally achieved through separate Phase an adaptive design based on revised estimates of
IIb and Phase III trials. treatment effect is nearly always less efficient than
Adaptive Drop-the-Loser Design. Adaptive drop- a group sequential approach. Dramatic bias can
the-loser design allows dropping of low-performing occur when power computation is being per-
treatment group(s). formed because of significance of interim results.
Yet medical researchers tend to prefer adaptive
Adaptive Randomization Design. Adaptive
randomization design allows modification of designs, mostly because (a) clinically meaningful
randomization schedules. effect size can change when results from other
trials may suggest that smaller effects than origi-
Adaptive Dose-Escalation Design. An adaptive nally postulated are meaningful; (b) it is easier to
dose-escalation design is used to identify the
request a small budget initially, with an option to
maximum tolerated dose (MTD) of a medication.
ask for supplemental funding after seeing the
This design is usually considered optimal in later-
phase clinical trials. interim data; and (c) investigators may need to see
some data before finalizing the design.
Adaptive Treatment-Switching Design. An adaptive
treatment-switching design allows investigators to Abdus S. Wahed
Alternative Hypotheses 11
researchers. As further work showed, mosquitoes, conceptual alternative has been inferred. Surely,
which live in swampy areas, were the primary anyone can reject a null, but few can identify and
transmitters of the disease, making the swamp- infer a correct alternative.
lands alternative incorrect.
Daniel J. Denis, Annesa Flentje Santa,
and Chelsea Burfeind
The Importance of Experimental Control
See also Hypothesis; Null Hypothesis
One of the most significant challenges posed by an
inference of the scientific alternative hypothesis is
the infinite number of plausible explanations for Further Readings
the rejection of the null. There is no formal statisti-
cal procedure for arriving at the correct scientific Cohen, J. (1994). The earth is round (p < .05). American
alternative hypothesis. Researchers must rely on Psychologist, 49, 997–1003.
Cowles, M. (2000). Statistics in psychology: An historical
experimental control to help narrow the number
perspective. Philadelphia: Lawrence Erlbaum.
of plausible explanations that could account for Denis, D. J. (2001). Inferring the alternative hypothesis:
the rejection of the null hypothesis. In theory, if Risky business. Theory & Science, 2, 1. Retrieved
every conceivable extraneous variable were con- December 2, 2009, from http://theoryandscience
trolled for, then inferring the scientific alternative .icaap.org/content/vol002.001/03denis.html
hypothesis would not be such a difficult task. Hays, W. L. (1994). Statistics (5th ed.). New York:
However, since there is no way to control for every Harcourt Brace.
possible confounding variable (at least not in Neyman, J., & Pearson, E. S. (1928). On the use and
most social sciences, and even many physical interpretation of certain test criteria for purposes of
sciences), the goal of good researchers must be to statistical inference (Part 1). Biometrika, 20A,
175–240.
control for as many extraneous factors as possible.
The quality and extent of experimental control
is proportional to the likelihood of inferring cor-
rect scientific alternative hypotheses. Alternative
hypotheses that are inferred without the prerequi- AMERICAN EDUCATIONAL
site of such things as control groups built into the RESEARCH ASSOCIATION
design of the study or experiment are at best plau-
sible explanations as to why the null was rejected,
The American Educational Research Association
and at worst, fashionable hypotheses that the
(AERA) is an international professional organiza-
researcher seeks to endorse without the appropri-
tion based in Washington, D.C., and dedicated to
ate scientific license to do so.
promoting research in the field of education.
Through conferences, publications, and awards,
Concluding Comments AERA encourages the scientific pursuit and dis-
semination of knowledge in the educational arena.
Hypothesis testing is an integral part of every
Its membership is diverse, drawn from within the
social science researcher’s job. The statistical and
education professions, as well as from the broader
conceptual alternatives are two distinct forms of
social science field.
the alternative hypothesis. Researchers are most
often interested in the conceptual alternative
hypothesis. The conceptual alternative hypothesis
Mission
plays an important role; without it, no conclusions
could be drawn from research (other than rejecting The mission of AERA is to influence the field of
a null). Despite its importance, hypothesis testing education in three major ways: (1) increasing
in the social sciences (especially the softer social knowledge about education, (2) promoting educa-
sciences) has been dominated by the desire to tional research, and (3) encouraging the use of
reject null hypotheses, whereas less attention has educational research results to make education
been focused on establishing that the correct better and thereby improve the common good.
American Educational Research Association 13
The Journal of Educational and Behavioral Statistics provides or sponsors the most offerings for gradu-
focuses on new statistical methods for use in educa- ate students is the graduate student council. This
tional research, as well as critiques of current prac- group, composed of 28 graduate students and divi-
tices. It is published jointly with the American sion and staff representatives, meets at every
Statistical Association. The Review of Educational annual meeting to plan offerings for the graduate
Research publishes reviews of previously published students. Its mission is to support graduate student
articles by interested parties from varied back- members to become professional researchers or
grounds. The Review of Research in Education is practitioners though education and advocacy. The
an annual publication that solicits critical essays on graduate student council sponsors many sessions
a variety of topics facing the field of education. All at the annual meeting, as well as hosting a graduate
AERA’s journals are published by Sage. student resource center at the event. It also pub-
lishes a newsletter three times per year and hosts
Annual Meetings a Listserv where graduate students can exchange
information.
AERA’s annual meetings are an opportunity to
bring AERA’s diverse membership together to dis-
cuss and debate the latest in educational practices
Awards
and research. Approximately 16,000 attendees
gather annually to listen, discuss, and learn. For the AERA offers an extensive awards program, and
2008 meeting, 12,024 presentation proposals were award recipients are announced at the president’s
submitted, and more than 2,000 were presented. In address during the annual meeting. AERA’s divi-
addition to presentations, many business meetings, sions and SIGs also offer awards, which are pre-
invited sessions, awards, and demonstrations are sented during each group’s business meeting.
held. Several graduate student-oriented sessions are AERA’s awards cover educational researchers at
also held. Many sessions focusing on educational all stages of their career, from the Early Career
research related to the geographical location of the Award to the Distinguished Contributions to
annual meeting are also presented. Another valu- Research in Education Award. Special awards are
able educational opportunity is the many profes- also given in other areas, including social justice
sional development and training courses offered issues, public service, and outstanding books.
during the conference. These tend to be refresher
courses in statistics and research design or evalua-
tion or workshops on new assessment tools or class- Fellowships and Grants
room-based activities. In addition to the scheduled Several fellowships are offered through AERA,
sessions, exhibitors of software, books, and testing with special fellowships focusing on minority
materials present their wares at the exhibit hall, and researchers, researchers interested in measurement
members seeking new jobs can meet prospective (through a program with the Educational Testing
employers in the career center. Tours of local attrac- Service), and researchers interested in large-scale
tions are also available. Each year’s meeting is orga- studies through a partnership with the American
nized around a different theme. In 2008, the annual Institutes for Research. AERA also offers several
meeting theme was Research on Schools, Neighbor- small grants for various specialties, awarded up to
hoods, and Communities: Toward Civic Responsi- three times per year.
bility. The meeting takes place at the same time and
place as the annual meeting of the National Council Carol A. Carman
on Measurement in Education.
See also American Statistical Association; National
Council on Measurement in Education
Other Services and Offerings
Graduate Student Council Further Readings
Graduate students are supported through sev- Hultquist, N. J. (1976). A brief history of AERA’s
eral programs within AERA, but the program that publishing. Educational Researcher, 5(11), 9–13.
16 American Psychological Association Style
Mershon, S., & Schlossman, S. (2008). Education, Past tense should also be used to describe results
science, and the politics of knowledge: The American of an empirical study conducted by the author
Educational Research Association, 1915–1940. (e.g., ‘‘self-esteem increased over time’’). Present
American Journal of Education, 114(3), 307–340. tense (e.g., ‘‘these results indicate’’) should be used
in discussing and interpreting results and drawing
Websites conclusions.
American Educational Research Association:
http://www.aera.net
Nonbiased Language
APA style guidelines recommend that authors avoid
AMERICAN PSYCHOLOGICAL language that is biased against particular groups.
ASSOCIATION STYLE APA provides specific guidelines for describing age,
gender, race or ethnicity, sexual orientation, and
disability status. Preferred terms change over time
American Psychological Association (APA) style
and may also be debated within groups; authors
is a system of guidelines for writing and format-
should consult a current style manual if they are
ting manuscripts. APA style may be used for
unsure of the terms that are currently preferred or
a number of types of manuscripts, such as theses,
considered offensive. Authors may also ask study
dissertations, reports of empirical studies, litera-
participants which term they prefer for themselves.
ture reviews, meta-analyses, theoretical articles,
General guidelines for avoiding biased language
methodological articles, and case studies. APA
include being specific, using labels as adjectives
style is described extensively in the Publication
instead of nouns (e.g., ‘‘older people’’ rather than
Manual of the American Psychological Associa-
‘‘the elderly’’), and avoiding labels that imply
tion (APA Publication Manual). The APA Publi-
a standard of judgment (e.g., ‘‘non-White,’’ ‘‘stroke
cation Manual includes recommendations on
victim’’).
writing style, grammar, and nonbiased language,
as well as guidelines for manuscript formatting,
such as arrangement of tables and section head-
ings. The first APA Publication Manual was pub- Formatting
lished in 1952; the most recent edition was
published in 2009. APA style is the most The APA Publication Manual also provides a num-
accepted writing and formatting style for jour- ber of guidelines for formatting manuscripts. These
nals and scholarly books in psychology. The use include guidelines for use of numbers, abbrevia-
of a single style that has been approved by the tions, quotations, and headings.
leading organization in the field aids readers,
researchers, and students in organizing and
understanding the information presented.
Tables and Figures
Tables and figures may allow numerical infor-
Writing Style
mation to be presented more clearly and concisely
The APA style of writing emphasizes clear and than would be possible in text. Tables and figures
direct prose. Ideas should be presented in an may also allow for greater ease in comparing
orderly and logical manner, and writing should be numerical data (for example, the mean depression
as concise as possible. Usual guidelines for clear scores of experimental and control groups). Fig-
writing, such as the presence of a topic sentence in ures and tables should present information clearly
each paragraph, should be followed. Previous and supplement, rather than restate, information
research should be described in either the past provided in the text of the manuscript. Numerical
tense (e.g., ‘‘Surles and Arthur found’’) or the pres- data reported in a table should not be repeated in
ent perfect tense (e.g., ‘‘researchers have argued’’). the text.
American Psychological Association Style 17
that did not support the stated hypotheses. The (e.g., American Psychologist, Child Development,
results section will typically include inferential sta- Journal of Personality and Social Psychology).
tistics, such as chi-squares, F tests, or t tests. For Peer review is the process of evaluation of scientific
these statistics, the value of the test statistic, degrees work by other researchers with relevant areas of
of freedom, p value, and size and direction of effect expertise. The methodology and conclusions of an
should be reported, for instance, F(1, 39) ¼ 4.86, article published in a peer-reviewed journal have
p ¼ .04, η2 ¼ .12. been examined and evaluated by several experts in
The results section may include figures (such as the field.
graphs or models) and tables. Figures and tables
will typically appear at the end of a manuscript. If
In-Text Citations
the manuscript is being submitted for publication,
notes may be included in the text to indicate where Throughout the manuscript text, credit should
figures or tables should be placed (e.g., ‘‘Insert be given to authors whose work is referenced. In-
Table 1 here.’’). text citations allow the reader to be aware of the
source of an idea or finding and locate the work in
the reference list at the end of the manuscript.
Discussion
APA style uses an author-date citation method
The discussion section is where the findings and (e.g., Bandura, 1997). For works with one or two
analyses presented in the results section are sum- authors, all authors are included in each citation.
marized and interpreted. The author should dis- For works with three to five authors, the first in-
cuss the extent to which the results support the text citation lists all authors; subsequent citations
stated hypotheses. Conclusions should be drawn list only the first author by name (e.g., Hughes,
but should remain within the boundaries of the Bigler, & Levy, 2007, in first citation, Hughes et
data obtained. Ways in which the findings of the al., 2007, in subsequent citations). Works with six
current study relate to the theoretical perspectives or more authors are always cited in the truncated
presented in the introduction should also be et al. format.
addressed. This section should briefly acknowledge
the limitations of the current study and address Reference Lists
possible alternative explanations for the research
findings. The discussion section may also address References should be listed alphabetically by
potential applications of the work or suggest the last name of the first author. Citations in the
future research. reference list should include names of authors,
article or chapter title, and journal or book title.
References to articles in journals or other period-
Referring to Others’ Work
icals should include the article’s digital object
It is an author’s job to avoid plagiarism by noting identifier if one is assigned. If a document was
when reference is made to another’s work or ideas. accessed online, the reference should include
This obligation applies even when the author is a URL (Web address) where the material can be
making general statements about existing knowl- accessed. The URL listed should be as specific as
edge (e.g., ‘‘Self-efficacy impacts many aspects of possible; for example, it should link to the article
students’ lives, including achievement motivation rather than to the publication’s homepage. The
and task persistence (Bandura, 1997).’’). Citations APA Publication Manual includes guidelines for
allow a reader to be aware of the original source citing many different types of sources. Examples
of ideas or data and direct the reader toward of some of the most common types of references
sources of additional information on a topic. appear below.
When preparing a manuscript, an author may
be called on to evaluate sources and make deci- Book: Bandura, A. (1997). Self-efficacy: The
sions about the quality of research or veracity of exercise of control. New York: W. H. Freeman.
claims. In general, the most authoritative sources Chapter in edited book: Powlishta, K. K. (2004).
are articles published in peer-reviewed journals Gender as a social category: Intergroup processes
American Statistical Association 19
and gender-role development. In M. Bennet & consumers representing a wide range of science
F. Sani (Eds.), The development of the social self and education fields. Since its inception in Novem-
(pp. 103–133). New York: Psychology Press. ber 1839, the ASA has aimed to provide both sta-
Journal article: Hughes, J. M., Bigler, R. S., & tistical science professionals and the public with
Levy, S. R. (2007). Consequences of learning about a standard of excellence for statistics-related pro-
historical racism among European American and jects. According to ASA publications, the society’s
African American children. Child Development, mission is ‘‘to promote excellence in the applica-
78, 1689–1705. doi: 10.1111/j.1467–8624.2007 tion of statistical science across the wealth of
.01096.x human endeavor.’’ Specifically, the ASA mission
Article in periodical (magazine or newspaper): includes a dedication to excellence with regard to
Gladwell, M. (2006, February 6). Troublemakers: statistics in practice, research, and education;
What pit bulls can teach us about profiling. The a desire to work toward bettering statistical educa-
New Yorker, 81(46), 33–41. tion and the profession of statistics as a whole;
Research report: Census Bureau. (2006). Voting a concern for recognizing and addressing the needs
and registration in the election of November 2004. of ASA members; education about the proper uses
Retrieved from U.S. Census Bureau website: http:// of statistics; and the promotion of human welfare
www.census.gov/population/www/socdemo/ through the use of statistics.
voting.html Regarded as the second-oldest continuously
operating professional association in the United
Meagan M. Patterson States, the ASA has a rich history. In fact, within 2
years of its founding, the society already had
See also Abstract; Bias; Demographics; Discussion
a U.S. president—Martin Van Buren—among its
Section; Dissertation; Methods Section; Results
members. Also on the list of the ASA’s historical
Section
members are Florence Nightingale, Alexander
Graham Bell, and Andrew Carnegie. The original
Further Readings
founders, who united at the American Education
American Psychological Association. (2009). Publication Society in Boston to form the society, include U.S.
Manual of the American Psychological Association Congressman Richard Fletcher; teacher and fun-
(6th ed.). Washington, DC: Author. draiser William Cogswell; physician and medicine
APA Publications and Communications Board Working reformist John Dix Fisher; statistician, publisher,
Group on Journal Article Reporting Standards.
and distinguished public health author Lemuel
(2008). Reporting standards for research in
psychology: Why do we need them? What might they
Shattuck; and lawyer, clergyman, and poet Oliver
be? American Psychologist, 63, 839–851. Peabody. The founders named the new organiza-
Carver, R. P. (1984). Writing a publishable research tion the American Statistical Society, a name that
report in education, psychology, and related lasted only until the first official meeting in Febru-
disciplines. Springfield, IL: Charles C Thomas. ary 1840.
Cuba, L. J. (2002). A short guide to writing about social In its beginning years, the ASA developed
science (4th ed.). New York: Longman. a working relationship with the U.S. Census
Dunn, D. S. (2007). A short guide to writing about Bureau, offering recommendations and often lend-
psychology (2nd ed.). New York: Longman. ing its members as heads of the census. S. N. D.
Sabin, W. A. (2004). The Gregg reference manual (10th
North, the 1910 president of the ASA, was also
ed.). New York: McGraw-Hill.
the first director of the permanent census office.
The society, its membership, and its diversity in
statistical activities grew rapidly after World War I
AMERICAN STATISTICAL as the employment of statistics in business and
government gained popularity. At that time, large
ASSOCIATION cities and universities began forming local chap-
ters. By its 100th year in existence, the ASA had
The American Statistical Association (ASA) is more members than it had ever had, and those
a society for scientists, statisticians, and statistics involved with the society commemorated the
20 Analysis of Covariance (ANCOVA)
centennial with celebrations in Boston and Phila- Computational and Graphical Statistics; Journal
delphia. However, by the time World War II was of Educational and Behavioral Statistics; Journal
well under way, many of the benefits the ASA of Statistical Software; Journal of Statistics Educa-
experienced from the post–World War I surge were tion; Statistical Analysis and Data Mining; Statis-
reversed. For 2 years—1942 and 1943—the soci- tics in Biopharmaceutical Research; Statistics
ety was unable to hold annual meetings. Then, Surveys; and Technometrics.
after World War II, as after World War I, the ASA The official Web site of the ASA offers a more
saw a great expansion in both its membership and comprehensive look at the mission, history, publi-
its applications to burgeoning science endeavors. cations, activities, and future directions of the soci-
Today, ASA has expanded beyond the United ety. Additionally, browsers can find information
States and can count 18,000 individuals as mem- about upcoming meetings and events, descriptions
bers. Its members, who represent 78 geographic of outreach and initiatives, the ASA bylaws and
locations, also have diverse interests in statistics. constitution, a copy of the Ethical Guidelines for
These interests range from finding better ways to Statistical Practice prepared by the Committee on
teach statistics to problem solving for homelessness Professional Ethics, and an organizational list of
and from AIDS research to space exploration, board members and leaders.
among a wide array of applications. The society
comprises 24 sections, including the following: Kristin Rasmussen Teasdale
Bayesian Statistical Science, Biometrics, Biophar-
maceutical Statistics, Business and Economic Statis-
tics, Government Statistics, Health Policy Statistics, Further Readings
Nonparametric Statistics, Physical and Engineering Koren, J. (1970). The history of statistics: Their
Sciences, Quality and Productivity, Risk Analysis, development and progress in many countries. New
a section for Statistical Programmers and Analysts, York: B. Franklin. (Original work published 1918)
Statistical Learning and Data Mining, Social Statis- Mason, R. L. (1999). ASA: The first 160 years. Retrieved
tics, Statistical Computing, Statistical Consulting, October 10, 2009, from http://www.amstat.org/about/
Statistical Education, Statistical Graphics, Statistics first160years.cfm
and the Environment, Statistics in Defense and Wilcox, W. F. (1940). Lemuel Shattuck, statistician,
National Security, Statistics in Epidemiology, Statis- founder of the American Statistical Association.
tics in Marketing, Statistics in Sports, Survey Journal of the American Statistical Association, 35,
224–235.
Research Methods, and Teaching of Statistics in
the Health Sciences. Detailed descriptions of each
section, lists of current officers within each section, Websites
and links to each section are available on the ASA
American Statistical Association: http://www.amstat.org
Web site.
In addition to holding meetings coordinated by
more than 60 committees of the society, the ASA
sponsors scholarships, fellowships, workshops, ANALYSIS OF COVARIANCE
and educational programs. Its leaders and mem-
bers also advocate for statistics research funding (ANCOVA)
and offer a host of career services and outreach
projects. Behavioral sciences rely heavily on experiments
Publications from the ASA include scholarly and quasi experiments for evaluating the effects of,
journals, statistical magazines, books, research for example, new therapies, instructional methods,
guides, brochures, and conference proceeding or stimulus properties. An experiment includes at
publications. Among the journals available are least two different treatments (conditions), and
American Statistician; Journal of Agricultural, human participants are randomly assigned one
Biological, and Environmental Statistics; Journal treatment. If assignment is not based on randomi-
of the American Statistical Association; Journal zation, the design is called a quasi experiment. The
of Business and Economic Statistics; Journal of dependent variable or outcome of an experiment
Analysis of Covariance (ANCOVA) 21
or a quasi experiment, denoted by Y here, is usu- and a variance σ e2 , which is the same in both
ally quantitative, such as the total score on a clini- groups. By definition, α1 þ α2 ¼ 0, and so
cal questionnaire or the mean response time on α2 α1 ¼ 2α2 is the expected posttest group dif-
a perceptual task. Treatments are evaluated by ference adjusted for the covariate X. This is even
comparing them with respect to the mean of the better seen by rewriting Equation 1 as
outcome Y using either analysis of variance
(ANOVA) or analysis of covariance (ANCOVA). Yij β Xij X ¼ μ þ αj þ eij (2)
Multiple linear regression may also be used, and
categorical outcomes require other methods, such showing that ANCOVA is ANOVA of Y
as logistic regression. This entry explains the pur- adjusted for X. Due to the centering of X, that is,
poses of, and assumptions behind, ANCOVA for the subtraction of X, the adjustment is on the aver-
the classical two-group between-subjects design. age zero in the total sample. So the centering
ANCOVA for within-subject and split-plot designs affects individual outcome values and group
is discussed briefly at the end. means, but not the total or grand mean μ of Y.
Researchers often want to control or adjust sta- ANCOVA can also be written as a multiple
tistically for some independent variable that is not regression model:
experimentally controlled, such as gender, age, or Yij ¼ β0 þ β1 Gij þ β2 Xij þ eij (3)
a pretest value of Y. A categorical variable such as
gender can be included in ANOVA as an addi- where Gij is a binary indicator of treatment group
tional factor, turning a one-way ANOVA into (Gi1 ¼ 0 for controls, Gi2 ¼ 1 for treated), and β 2
a two-way ANOVA. A quantitative variable such is the slope β in Equation 3. Comparing Equation
as age or a pretest recording can be included as 1 with Equation 3 shows that β 1 ¼ 2α2 and that
a covariate, turning ANOVA into ANCOVA. β 0 ¼ ðμ α2 βXÞ. Centering in Equation 3
ANCOVA is the bridge from ANOVA to multiple both G and X (i.e., coding G as 1 and þ 1, and
regression. There are two reasons for including
subtracting X from X) will give β 0 ¼ μ and
a covariate in the analysis if it is predictive of the
β 1 ¼ α2 . Application of ANCOVA requires esti-
outcome Y. In randomized experiments, it reduces
mation of β in Equation 1. Its least squares solu-
unexplained (within-group) outcome variance,
tion is σσXY
2 , the within-group covariance between
thereby increasing the power of the treatment X
effect test and reducing the width of its confidence pre- and posttest, divided by the within-group pre-
interval. In quasi experiments, it adjusts for a group test variance, which in turn are both estimated
difference with respect to that covariate, thereby from the sample.
adjusting the between-group difference on Y for
confounding.
Assumptions
As Equations 1 and 3 show, ANCOVA assumes
Model
that the covariate has a linear effect on the out-
The ANCOVA model for comparing two groups come and that this effect is homogeneous, the
at posttest Y, using a covariate X, is as follows: same in both groups. So there is no treatment by
covariate interaction. Both the linearity and the
Yij ¼ μ þ αj þ βðXij XÞ þ eij , (1) homogeneity assumption can be tested and
relaxed by adding to Equation 3 as predictors
where Yij is the outcome for person i in group j X × X and G × X, respectively, but this entry
(e.g., j ¼ 1 for control, j ¼ 2 for treated), and Xij is concentrates on the classical model, Equation 1
the covariate value for person i in group j, μ is the or Equation 3. The assumption of homogeneity
grand mean of Y, αj is the effect of treatment j, β is of residual variance σ e2 between groups can also
the slope of the regression line for predicting Y be relaxed.
from X within groups, X is the overall sample Another assumption is that X is not affected by
mean of covariate X, and eij is a normally distrib- the treatment. Otherwise, X must be treated as
uted residual or error term with a mean of zero a mediator instead of as a covariate, with
22 Analysis of Covariance (ANCOVA)
consequences for the interpretation of analysis treatment) is the same with or without adjustment,
with versus without adjustment for X. If X is mea- again apart from sampling error. Things are differ-
sured before treatment assignment, this assump- ent for the MS (error), which is the denominator
tion is warranted. of the F test in ANOVA. ANCOVA estimates β
A more complicated ANCOVA assumption is such that the MS (error) is minimized, thereby
that X is measured without error, where error maximizing the power of the F test. Since the stan-
refers to intra-individual variation across replica- dard error (SE) of ^ is proportional to the square
tions. This assumption will be valid for a covariate root of the MS (error), this SE is minimized, lead-
such as age but not for a questionnaire or test ing to more precise effect estimation by covariate
score, in particular not for a pretest of the out- adjustment.
come at hand. Measurement error in X leads to In a nonrandomized study with groups differing
attenuation, a decrease of its correlation with Y on the covariate X, the covariate-adjusted group
and of its slope β in Equation 1. This leads to a loss effect ^ systematically differs from the unadjusted
of power in randomized studies and to bias in non- effect ðY 2 Y 1 Þ. It is unbiased if the ANCOVA
randomized studies. assumptions are satisfied and treatment assignment
A last ANCOVA assumption that is often men- is random conditional on the covariate, that is,
tioned, but not visible in Equation 1, is that there random within each subgroup of persons who are
is no group difference on X. This seems to contra- homogeneous on the covariate. Although the MS
dict one of the two purposes of ANCOVA, that is, (error) is again minimized by covariate adjustment,
adjustment for a group difference on the covariate. this does not imply that the SE of ^ is reduced.
The answer is simple, however. The assumption is This SE is a function not only of MS (error), but
not required for covariates that are measured with- also of treatment–covariate correlation. In a ran-
out measurement error, such as age. But if there is domized experiment, this correlation is zero apart
measurement error in X, then the resulting under- from sampling error, and so the SE depends only
estimation of its slope β in Equation 1 leads to on the MS (error) and sample size. In nonrando-
biased treatment effect estimation in case of mized studies, the SE increases with treatment–
a group difference on X. An exception is the case covariate correlation and can be larger with than
of treatment assignment based on the observed without adjustment. But in nonrandomized stud-
covariate value. In that case, ANCOVA is unbi- ies, the primary aim of covariate adjustment is cor-
ased in spite of measurement error in X, whether rection for bias, not a gain of power.
groups differ on X or not, and any attempt at cor- The two purposes of ANCOVA are illustrated
rection for attenuation will then introduce bias. in Figures 1 and 2, showing the within-group
The assumption of no group difference on X is regressions of outcome Y on covariate X, with the
addressed in more detail in a special section on the ellipses summarizing the scatter of individual per-
use of a pretest of the outcome Y as covariate. sons around their group line. Each group has its
own regression line with the same slope β (reflect-
ing absence of interaction) but different intercepts.
Purposes
In Figure 1, of a nonrandomized study, the groups
The purpose of a covariate in ANOVA depends on differ on the covariate. Moving the markers for
the design. To understand this, note that both group means along their regression line to
ANCOVA gives the following adjusted estimator a common covariate value X gives the adjusted
of the group difference: group difference ^ on outcome Y, reflected by the
vertical distance between the two lines, which is
^ ¼ ðY 2 Y 1 Þ βðX2 X1 Þ
ð4Þ also the difference between both intercepts. In Fig-
ure 2, of a randomized study, the two groups have
In a randomized experiment,
the group differ- the same mean covariate value, and so unadjusted
ence on the covariate, X1 X2 , is zero, and so and adjusted group difference on Y are the same.
the adjusted difference ^ is equal to the unad- However, in both figures the adjustment has yet
justed difference ðY 2 Y 1 Þ, apart from sampling another effect, illustrated in Figure 2. The MS
error. In terms of ANOVA, the mean square (MS; (error) of ANOVA without adjustment is the entire
Analysis of Covariance (ANCOVA) 23
Y Y
MS(error) in ANCOVA
MS(error) in ANOVA
Outcome Difference
Y2
Δ̂
Y1
X1 X X2 X X1 = X 2 = X X
Covariate Difference
during and after treatment, also allow covariates. advised to use the average of both pretests as cov-
The analysis is the same as for the within-subject ariate since this average suffers less from attenua-
design extended with gender and age. But interest tion by measurement error. In nonrandomized
now is in the Treatment (between-subject) × Time studies with only one pretest and one control
(within-subject) interaction and, if there is no such group, researchers should apply ANCOVA and
interaction, in the main effect of treatment aver- ANOVA of change and pray that they lead to the
aged across the repeated measures, rather than in same conclusion, differing in details only.
the main effect of the within-subject factor time. A Additionally, if there is substantial dropout
pretest recording can again be included as covari- related to treatment or covariates, then all data
ate or as repeated measure, depending on the treat- should be included in the analysis to prevent bias,
ment assignment procedure. Note, however, that using mixed (multilevel) regression instead of tra-
as the number of repeated measures increases, the ditional ANOVA to prevent listwise deletion of
F test of the Treatment × Time interaction may dropouts. Further, if pretest data are used as an
have low power. More powerful are the Treat- inclusion criterion in a nonrandomized study, then
ment × Linear (or Quadratic) Time effect test and the pretest data of all excluded persons should be
discriminant analysis. included in the effect analysis by mixed regression
Within-subject and repeated measures designs to reduce bias.
can have not only between-subject covariates
such as age but also within-subject or time- Gerard J. P. Van Breukelen
dependent covariates. Examples are a baseline
See also Analysis of Variance (ANOVA); Covariate;
recording within each treatment of a crossover
Experimental Design; Gain Scores, Analysis of;
trial, and repeated measures of a mediator. The
PretestPosttest Design; Quasi-Experimental Design;
statistical analysis of such covariates is beyond
Regression Artifacts; Split-Plot Factorial Design
the scope of this entry, requiring advanced meth-
ods such as mixed (multilevel) regression or
structural equations modeling, although the case Further Readings
of only two repeated measures allows a simpler Campbell, D. T., & Kenny, D. A. (1999). A primer on
analysis by using as covariates the within-subject regression artifacts. New York: Guilford Press.
average and difference of the original covariate. Cohen, J., Cohen, P., West, S. G., & Aiken, L. (2003).
Applied multiple regression/correlation analysis for the
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Practical Recommendations for the
Frison, L., & Pocock, S. (1997). Linearly divergent
Analysis of Studies With Covariates treatment effects in clinical trials with repeated
measures: Efficient analysis using summary statistics.
Based on the preceding text, the following recom-
Statistics in Medicine, 16, 2855–2872.
mendations can be given: In randomized studies, Judd, C. M., Kenny, D. A., & McClelland, G. H. (2001).
covariates should be included to gain power, nota- Estimating and testing mediation and moderation in
bly a pretest of the outcome. Researchers are within-subject designs. Psychological Methods, 6,
advised to center covariates and check linearity 115–134.
and absence of treatment–covariate interaction as Maxwell, S. E., & Delaney, H. D. (1990). Designing
well as normality and homogeneity of variance of experiments and analyzing data: A model
the residuals. In nonrandomized studies of preex- comparison perspective. Pacific Grove, CA:
isting groups, researchers should adjust for covari- Brooks/Cole.
ates that are related to the outcome to reduce bias. Rausch, J. R., Maxwell, S. E., & Kelley, K. (2003).
Analytic methods for questions pertaining to
With two pretests or two control groups, research-
a randomized pretest, posttest, follow-up design.
ers should check the validity of ANCOVA and Journal of Clinical Child & Adolescent Psychology,
ANOVA of change by treating the second pretest 32(3), 467–486.
as posttest or the second control group as experi- Reichardt, C. S. (1979). The statistical analysis of data
mental group. No group effect should then be from nonequivalent group designs. In T. D. Cook &
found. In the real posttest analysis, researchers are D. T. Campbell (Eds.), Quasi-experimentation: Design
26 Analysis of Variance (ANOVA)
and analysis issues for field settings (pp. 147–205). Table 1 Comparison of Two Treatments Based On
Boston: Houghton-Mifflin. Systolic Blood Pressure Change
Rosenbaum, P. R. (1995). Observational studies. New
York: Springer. Treatment
Senn, S. J. (2006). Change from baseline and analysis of
covariance revisited. Statistics in Medicine, 25, Placebo Drug A
4334–4344.
Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated 1.3 4.0
measures in clinical trials: Simple strategies for 1.5 5.7
analysis using summary measures. Statistics in
Medicine, 19, 861–877. 0.5 3.5
Van Breukelen, G. J. P. (2006). ANCOVA versus change
0.8 0.4
from baseline: More power in randomized studies,
more bias in nonrandomized studies. Journal of 1.1 1.3
Clinical Epidemiology, 59, 920–925.
Winkens, B., Van Breukelen, G. J. P., Schouten, H. J. A., 3.4 0.8
& Berger, M. P. F. (2007). Randomized clinical trials 0.8 10.7
with a pre- and a post-treatment measurement:
Repeated measures versus ANCOVA models. 3.6 0.3
Contemporary Clinical Trials, 28, 713–719.
0.3 0.5
2.2 3.3
variance ascribable to other groups’’ (p. 216). A1: All samples are simple random samples drawn
Henry Scheffé defined ANOVA as ‘‘a statistical from each of k populations representing k
technique for analyzing measurements depending categories of a factor.
on several kinds of effects operating simulta- A2: Observations are independent of one another.
neously, to decide which kinds of effects are
A3: The dependent variable is normally distributed
important and to estimate the effects. The mea- in each population.
surements or observations may be in an experi-
mental science like genetics or a nonexperimental A4: The variance of the dependent variable is the
one like astronomy’’ (p. 3). At first, this methodol- same in each population.
ogy focused more on comparing the means while
treating variability as a nuisance. Nonetheless, Suppose, for the jth group, the data consist
since its introduction, ANOVA has become the of the nj measurements Yj1 , Yj2 ; . . . ; Ynj ; j ¼ 1;
most widely used statistical methodology for test- 2; . . . ; k. Then the total variation in the data can
ing the significance of treatment effects. be expressed as the corrected sum of squares (SS)
P Pnj
Based on the number of categorical variables, as follows: TSS ¼ kj¼1 i¼1 ðYji yÞ2 , where y is
ANOVA can be distinguished into one-way the mean of the overall sample. On the other hand,
ANOVA and two-way ANOVA. Besides, ANOVA variation due to the factor is given by
models can also be separated into a fixed-effects
model, a random-effects model, and a mixed model X
k
based on how the factors are chosen during data SST ¼ ðyj yÞ2 ; (2)
collection. Each of them is described separately. j¼1
Table 3 General ANOVA Table for One-Way case, in which one factor is fixed and the other fac-
ANOVA (k populations) tor is random. Two-way ANOVA is applied to
Source d.f. SS MS F answer the question of whether Factor A has a sig-
nificant effect on the response adjusted for Factor
SST MST B, whether Factor B has a significant effect on the
Between k1 SST MST ¼
k1 MSE response adjusted for Factor A, or whether there is
SSE an interaction effect between Factor A and
Within nk SSE MSE ¼ Factor B.
nk
Total n1 TSS All null hypotheses can be written as
Note: n ¼ sample size; k ¼ number of groups; SST ¼ sum of
1. H01 : There is no Factor A effect.
squares treatment (factor); MST ¼ mean square treatment (fac-
tor) ; SSE ¼ sum of squares error; TSS ¼ total sum of squares. 2. H02 : There is no Factor B effect.
3. H03 : There is no interaction effect between
For a given level of significance α, the null Factor A and Factor B.
hypothesis H0 would be rejected and one could con-
clude that k population means are not all equal if The ANOVA table for two-way ANOVA is
shown in Table 4.
F ≥ Fk1; nk; 1α (3) In the fixed case, for a given α, the null hypoth-
esis H01 would be rejected, and one could con-
where Fk1; nk; 1α is the 100(1 α)% point of F clude that there is a significant effect of Factor A if
distribution with k 1 and n k df.
FðFactor AÞ ≥ Fr1;rcðn1Þ;1α , (4)
Two-Way ANOVA where Fr1; rcðn1Þ; 1α is the 100(1 α)% point of
Two-way ANOVA is used to assess the effects of F distribution with r 1 and rc(n 1) df.
two factors and their interaction on a single The null hypothesis H02 would be rejected, and
response variable. There are three cases to be con- one could conclude that there is a significant effect
sidered: the fixed-effects case, in which both fac- of Factor B if
tors are fixed; the random-effects case, in which
both factors are random; and the mixed-effects FðFactorBÞ ≥ Fc1; rcðn1Þ; 1α (5)
where Fc1; rcðn1Þ; 1α is the 100(1 α)% point multivariable methods (3rd ed.). Pacific Grove, CA:
of F distribution with c 1 and rc(n 1) df. Duxbury Press.
The null hypothesis H03 would be rejected, and Lindman, H. R. (1992). Analysis of variance in
one could conclude that there is a significant effect experimental design. New York: Springer-Verlag.
Scheffé, H. (1999). Analysis of variance. New York:
of interaction between Factor A and Factor B if
Wiley-Interscience.
The experimental method requires at minimum effects comprise the second form of sequence effects
two groups: the experimental group and the con- and provide a confound wherein the experimenter
trol group. Subjects (nonhuman animals) or partici- does not know whether treatment effects or order
pants (human animals) in the experimental group effects caused the change in the dependent or
receive the treatment, and subjects or participants response variable. Counterbalancing the order in
in the control group do not. All other variables are which subjects receive the treatments can eliminate
held constant or eliminated. When conducted cor- order effects, but in lesion studies, this is not possi-
rectly and carefully, the experimental method can ble. It is interesting that counterbalancing will not
determine cause-and-effect relationships. It is the eliminate carryover effects. However, such effects are
only method that can. often, but not always, eliminated when the experi-
menter increases the time between conditions.
Carryover effects are not limited to a single
Research Designs With One Factor
experiment. Animal cognition experts studying sea
Completely Randomized Design lions, dolphins, chimpanzees, pigeons, or even gray
parrots often use their subjects in multiple experi-
The completely randomized design is character- ments, stretching over years. While the practice is
ized by one independent variable in which subjects not ideal, the cost of acquiring and maintaining the
receive only one level of treatment. Subjects or par- animal over its life span dictates it. Such effects can
ticipants are randomly drawn from a larger popu- be reduced if subjects are used in experiments that
lation, and then they are randomly assigned to one differ greatly or if long periods of time have elapsed
level of treatment. All other variables are held con- between studies. In some instances, researchers take
stant, counterbalanced, or eliminated. Typically, advantage of carryover effects. Animals that are
the restriction of equal numbers of subjects in each trained over long periods to perform complex tasks
group is required. Independent variables in which will often be used in an extended series of related
subjects experience only one level are called experiments that build on this training.
between-subjects variables, and their use is wide- Data from the completely randomized design
spread in the animal literature. Testable hypotheses can be statistically analyzed with parametric or
include the following: What dosage of drug has nonparametric statistical tests. If the assumptions
the greatest effect on reducing seizures in rats? of a parametric test are met, and there are only
Which of five commercial diets for shrimp leads to two levels of treatment, data are analyzed with an
the fastest growth? Does experience influence egg- independent t test. For three or more groups, data
laying sites in apple snails? Which of four methods are analyzed using the analysis of variance
of behavioral enrichment decreases abnormal (ANOVA) with one between-subjects factor.
behavior in captive chimpanzees the most? Sources of variance include treatment variance,
The completely randomized design is chosen which includes both treatment and error variance,
when carryover effects are of concern. Carryover and error variance by itself. The F test is Treat-
effects are one form of sequence effects and result ment þ Error Variance divided by Error Variance.
when the effect of one treatment level carries over F scores greater than 1 indicate the presence of
into the next condition. For example, behavioral treatment variability. Because a significant F score
neuroscientists often lesion or ablate brain tissue to tells the experimenter only that there is at least
assess its role in behavioral systems including repro- one significant difference, post hoc tests are
duction, sleep, emotion, learning, and memory. In required to determine where the differences lie.
these studies, carryover effects are almost guaran- Depending on the experimenter’s need, several
teed. Requiring all subjects to proceed through the post hoc tests, including a priori and a posteriori,
control group first and then the experimental group are available. If the assumptions of a parametric
is not an option. In cases in which subjects experi- test are of concern, the appropriate nonparametric
ence treatment levels in the same order, performance test is the Mann–Whitney U test for two-group
changes could result through practice or boredom or designs and the Kruskal–Wallis test if three or
fatigue on the second or third or fourth time the ani- more groups are used. Mann–Whitney U tests pro-
mals experience the task. These so-called order vide post hoc analyses.
Animal Research 31
be attributable to carryover effects, order effects, change during the year? Does learning ability dif-
or treatment effects. If the subject does return to fer between predators and prey in marine and
baseline, the effect is due to the treatment and not freshwater environments?
due to sequence effects. Note also that this experi- Besides the obvious advantage of reducing
ment is carried out in Las Vegas every day, except expenses by testing two or more variables at the
that Conditions 1 and 3 are never employed. same time, the completely randomized design can
Data from the repeated measures design are determine whether the independent variables inter-
analyzed with a dependent t test if two groups are act to produce a third variable. Such interactions
used and with the one-way repeated measures can lead to discoveries that would have been
ANOVA or randomized-block design if more missed with a single-factor design. Consider the
than two treatment levels are used. Sources of following example: An animal behaviorist, work-
variance include treatment variance, subject var- ing for a pet food company, is asked to determine
iance, and residual variance, and as mentioned, whether a new diet is good for all ages and all
treatment variance is divided by residual vari- breeds of canines. With limited resources of time,
ance to obtain the F score. Post hoc tests deter- housing, and finances, the behaviorist decides to
mine where the differences lie. The two-group test puppies and adults from a small breed of dogs
nonparametric alternative is Wilcoxon’s signed and puppies and adults from a large breed. Twelve
ranks test. With more than two groups, the healthy animals for each condition (small-breed
Friedman test is chosen and Wilcoxon’s signed puppy, small-breed adult, large-breed puppy, and
ranks tests serve as post hoc tests. large-breed adult) are obtained from pet suppliers.
All dogs are acclimated to their new surrounds for
a month and then maintained on the diet for 6
Research Designs With Two or More Factors months. At the end of 6 months, an index of body
mass (BMI) is calculated for each dog and is sub-
Completely Randomized Factorial Design
tracted from the ideal BMI determined for these
Determining the causes of nonhuman animal breeds and ages by a group of veterinarians from
behavior often requires manipulation of two or the American Veterinary Medical Association.
more independent variables simultaneously. Such Scores of zero indicate that the diet is ideal. Nega-
studies are possible using factorial research tive scores reflect dogs that are gaining too much
designs. In the completely randomized factorial weight and becoming obese. Positive scores reflect
design, subjects experience only one level of each dogs that are becoming underweight.
independent variable. For example, parks and The results of the analysis reveal no significant
wildlife managers might ask whether male or effect of Age, no significant effect of Breed, but
female grizzly bears are more aggressive in spring, a very strong interaction. The CEO is ecstatic,
when leaving their dens after a long period of inac- believing that the absence of main effects means
tivity, or in fall, when they are preparing to enter that the BMI scores do not differ from zero and
their dens. In this design, bears would be randomly that the new dog food is great for all dogs. How-
chosen from a larger population and randomly ever, a graph of the interaction reveals otherwise.
assigned to one of four combinations (female fall, Puppies from the small breed have very low BMI
female spring, male fall, or male spring). The com- scores, indicating severely low weight. Ironically,
pletely randomized design is used when sequence puppies from the large breed are extremely obese,
effects are of concern or if it is unlikely that the as are small-breed adult dogs. However, large-
same subjects will be available for all conditions. breed adults are dangerously underweight. In
Examples of testable hypotheses include the fol- essence, the interaction reveals that the new
lowing: Do housing conditions of rhesus monkeys diet affects small and large breeds differentially
bred for medical research affect the efficacy of depending on age, with the outcome that the new
antianxiety drugs? Do different dosages of a newly diet would be lethal for both breeds at both age
developed pain medication affect females and levels, but for different reasons. The lesson here is
males equally? Does spatial memory ability in that when main effects are computed, the mean
three different species of mountain jays (corvids) for the treatment level of one variable contains the
Animal Research 33
scores from all levels of the other treatment. The randomly drawn from a larger population and
experimenter must examine the interaction effect assigned randomly to one level of the between-
carefully to determine whether there were main subjects variable. Subjects experience all levels of
effects that were disguised by the interaction or the within-subjects variable. The order in which
whether there were simply no main effects. subjects receive the treatment levels of the within-
In the end, the pet food company folded, but subjects variable is counterbalanced unless the
valuable design and statistical lessons were learned. experimenter wants to examine change over time
First, in factorial designs there is always the or repeated exposures to the treatment. The split-
potential for the independent variables to interact, plot design is particularly useful for studying the
producing a third treatment effect. Second, a signifi- effects of treatment over time or exposure to treat-
cant interaction means that the independent vari- ment. Often, order of treatment cannot be counter-
ables affect each other differentially and that the balanced in the within-subject factor. In these cases,
main effects observed are confounded by the inter- order effects are expected. Examples of testable
action. Consequently, the focus of the analysis must hypotheses include the following: Can crabs learn
be on the interaction and not on the main effects. to associate the presence of polluted sand with ill-
Finally, the interaction can be an expected outcome, ness? Can cephalopods solve a multiple T-maze
a neutral event, or a complete surprise. For exam- faster than salmonids? Does tolerance to pain medi-
ple, as discussed in the split-plot design, to demon- cation differ between males and females? Does
strate that learning has occurred, a significant maturation of the hippocampus in the brains of rats
interaction is required. On the surprising side, on affect onset of the paradoxical effects of reward?
occasion, when neuroscientists have administered In experiments in which learning is the focus,
two drugs at the same time, or lesioned two brain order effects are confounded with Trials, the
sites at the same time, the variables interacted to within-subjects treatment, and an interaction is
produce effects that relieved symptoms better than expected. For example, in a typical classical condi-
either drug alone, or the combination of lesions tioning experiment, subjects are randomly assigned
produced an effect never before observed. to the paired group or the unpaired group. Sub-
Statistically, data from the completely random- jects in the paired group receive paired presenta-
ized factorial design are analyzed with a two-way tions of a stimulus (light, tone, etc.) and a second
ANOVA with both factors between-subject vari- stimulus that causes a response (meat powder, light
ables. Sources of variance include treatment vari- foot shock). Subjects in the unpaired group receive
ability for each independent variable and each presentations of both stimuli, but they are never
unique combination of independent variables and paired. Group (paired/unpaired) serves as the
error variance. Error variance is estimated by add- between-subjects factor, and Trials (1–60) serves as
ing the within-group variability for each AB cell. A the within-subjects factor. Evidence of learning is
single error term is used to test all treatment effects obtained when the number of correct responses
and the interaction. Post hoc tests determine where increases as a function of trials for the paired
the differences lie in the main effects, and simple group, but not for the unpaired group. Thus, the
main effects tests are used to clarify the source of interaction, and not the main effects, is critical in
a significant interaction. Nonparametric statistical experiments on learning.
analyses for factorial designs are not routinely Statistical analysis of the split-plot or mixed
available on statistical packages. One is advised to design is accomplished with the mixed design
check the primary literature if a nonparametric ANOVA, with one or more between factors and
between-subjects factorial test is required. one or more within factors. With one between and
one within factor, there are five sources of vari-
ance. These include treatment variance for both
The Split-Plot Design
main effects and the interaction and two sources
The split-plot or mixed design is used extensively of error variance. The between factor is tested with
to study animal learning and behavior. In its sim- error variability attributable to that factor. The
plest form, the design has one between-subjects fac- within factor and the interaction are tested with
tor and one within-subjects factor. Subjects are residual error variance. Significant main effects are
34 Animal Research
further analyzed by post hoc tests, and the sources variance for both main effects and the interaction,
of the interaction are determined with simple main and three error terms. Separate error terms are
effects tests. Nonparametric tests are available in used to test each treatment effect and the interac-
the primary literature. tion. Post hoc tests and simple main effects tests
determine where the differences lie in the main
effects and the interaction, respectively. Nonpara-
Randomized Block Factorial Design
metric analyses for this design can be found in the
In this design, subjects experience all levels of primary literature.
two or more independent variables. Such experi-
ments can be difficult and time-consuming to con-
Choosing a Research Design
duct, to analyze, and to interpret. Sequence
effects are much more likely and difficult to con- Choosing a research design that is cost effective,
trol. Carryover effects can result from the inter- feasible, and persuasive and that minimizes the
action as well as the main effects of treatment probability of making Type 1 errors (experimenter
and consequently can be more difficult to detect rejects the null hypothesis when in fact it is true)
or eliminate. Order effects can be controlled and Type II errors (experimenter accepts the null
through counterbalancing, but the number of hypothesis when in fact it is false) requires infor-
possible orders quickly escalates, as exemplified mation about the availability, accessibility, mainte-
by the equation m(n)(n 1), where m equals the nance, care, and cost of subjects. It also requires
number of treatment levels in the independent knowledge of the subjects themselves, including
variable m, and n equals the number of treat- their stable traits, their developmental and evolu-
ment levels in n. tionary histories, their adaptability to laboratory
Because subjects experience all levels of all life, their tolerance of treatment, and their ability
treatments, subject variability can be subtracted to be trained. Knowledge of potential carryover
from the error term. This advantage, coupled with effects associated with the treatment and whether
the need for fewer subjects and the ability to test these effects are short lived or long lasting is also
for interactions, makes this design of value in important. Finally, the researcher needs to know
learning experiments with exotic animals. The how treatment variance and error variance are par-
design also finds application in the behavioral neu- titioned to take advantage of the traits of some ani-
rosciences when the possibility of interactions mals or to increase the feasibility of conducting the
between drugs presented simultaneously or experiment. In addition, the researcher needs to
sequentially needs to be assessed. The design can know how interaction effects can provide evidence
also assess the effects of maturation, imprinting, of change over time or lead to new discoveries.
learning, or practice on important behavioral sys- But there is more. Beginning-level researchers
tems, including foraging, migration, navigation, often make two critical mistakes when establishing
habitat selection, choosing a mate, parental care, a program of research. First, in their haste and
and so on. The following represent hypotheses that enthusiasm, they rush out and collect data and
are testable with this design: Do nest site selection then come back to the office not knowing how to
and nest building change depending on the success statistically analyze their data. When these same
of last year’s nest? Do successive operations to people take their data to a statistician, they soon
relieve herniated discs lead to more damage when learn a critical lesson: Never conduct an experi-
coupled with physical therapy? Are food prefer- ment and then attempt to fit the data to a particu-
ences in rhesus monkeys related to nutritional lar design. Choose the research design first, and
value, taste, or social learning from peers? Does then collect the data according to the rules of the
predation pressure influence a prey’s choice of diet design. Second, beginning-level researchers tend to
more in times when food is scarce, or when food is think that the more complex the design, the more
abundant? compelling the research. Complexity does not cor-
Data analysis is accomplished with a repeated relate positively with impact. Investigators should
measures factorial ANOVA. Seven sources of vari- opt for the simplest design that can answer the
ance are computed: subject variance, treatment question. The easier it is to interpret a results
Applied Research 35
section, the more likely it is that reviewers will The most common way applied research is
understand and accept the findings and the conclu- understood is by comparing it to basic research.
sions. Simple designs are easier to conduct, ana- Basic research—‘‘pure’’ science—is grounded in the
lyze, interpret, and communicate to peers. scientific method and focuses on the production of
new knowledge and is not expected to have an
Jesse E. Purdy immediate practical application. Although the dis-
See also Analysis of Variance (ANOVA); Confounding;
tinctions between the two contexts are arguably
Factorial Design; Nonparametric Statistics for the
somewhat artificial, researchers commonly identify
Behavioral Sciences; Parametric Statistics; Post Hoc
four differences between applied research and
Comparisons; Single-Subject Design
basic research. Applied research differs from basic
research in terms of purpose, context, validity, and
methods (design).
Further Readings
Conover, W. J. (1999). Practical nonparametric statistics Research Purpose
(3rd ed.). New York: Wiley. The purpose of applied research is to increase
Glover, T. J., & Mitchell, K. J. (2002). An introduction to
what is known about a problem with the goal of
biostatistics. New York: McGraw-Hill.
creating a better solution. This is in contrast to
Howell, D. C. (2007). Statistical methods for psychology
(6th ed.). Monterey, CA: Thomson/Wadsworth. basic research, in which the primary purpose is to
Kirk, R. E. (1995). Experimental design: Procedures for expand on what is known—knowledge—with lit-
behavioral sciences (3rd ed.). Monterey, CA: tle significant connections to contemporary pro-
Wadsworth. blems. A simple contrast that shows how research
Lehman, A., O’Rourke, N., Hatcher, L., & Stepanski, E. J. purpose differentiates these two lines of investiga-
(2005). JMP for basic univariate and multivariate tion can be seen in applied behavior analysis and
statistics: A step-by-step guide. Cary, NC: SAS Institute. psychological research. Applied behavior is
Wasserman, L. (2005). All of nonparametric statistics. a branch of psychology that generates empirical
New York: Springer.
observations that focus at the level of the individ-
Winer, B. J. (1971). Statistical principles in experimental
ual with the goal of developing effective interven-
design (2nd ed.). New York: McGraw-Hill.
tions to solve specific problems. Psychology, on the
other hand, conducts research to test theories or
explain changing trends in certain populations.
The irrelevance of basic research to immediate
APPLIED RESEARCH problems may at times be overstated. In one form
or another, observations generated in basic
Applied research is inquiry using the application of research eventually influence what we know about
scientific methodology with the purpose of gener- contemporary problems. Going back to the previ-
ating empirical observations to solve critical pro- ous comparison, applied behavior investigators
blems in society. It is widely used in varying commonly integrate findings generated by cogni-
contexts, ranging from applied behavior analysis tive psychologists—how people organize and ana-
to city planning and public policy and to program lyze information—in explaining specific types of
evaluation. Applied research can be executed behaviors and identifying relevant courses of inter-
through a diverse range of research strategies that ventions to modify them. The question is, how
can be solely quantitative, solely qualitative, or much time needs to pass (5 months, 5 years, 50
a mixed method research design that combines years) in the practical application of research
quantitative and qualitative data slices in the same results in order for the research to be deemed basic
project. What all the multiple facets in applied research? In general, applied research observations
research projects share is one basic commonality— are intended to be implemented in the first few
the practice of conducting research in ‘‘nonpure’’ years whereas basic researchers make no attempt
research conditions because data are needed to to identify when their observations will be realized
help solve a real-life problem. in everyday life.
36 Applied Research
procedure when random data from the situation accuracy in parameter estimation. Annual Review of
of interest are generated and a systematic search Psychology, 59, 537–563.
(e.g., a sequence) of different sample sizes is used Muthén, L., & Muthén, B. (2002). How to use a Monte
until the minimum sample size is found at which Carlo study to decide on sample size and determine
power. Structural Equation Modeling, 4, 599–620.
the specified goal is satisfied.
As another example of when an application of
a Monte Carlo simulation study would be useful,
Linda Muthén and Bengt Muthén have discussed APTITUDES AND
a general approach to planning appropriate sample
size in a confirmatory factor analysis and struc- INSTRUCTIONAL METHODS
tural equation modeling context by using an
a priori Monte Carlo simulation study. In addition Research on the interaction between student char-
to models in which all the assumptions are satis- acteristics and instructional methods is important
fied, Muthén and Muthén suggested sample size because it is commonly assumed that different stu-
planning using a priori Monte Carlo simulation dents learn in different ways. That assumption is
methods when data are missing and when data are best studied by investigating the interaction between
not normal—two conditions most sample size student characteristics and different instructional
planning methods do not address. methods. The study of that interaction received its
Even when analytic methods do exist for greatest impetus with the publication of Lee Cron-
designing studies, sensitivity analyses can be imple- bach and Richard Snow’s Aptitudes and Instruc-
mented within an a priori Monte Carlo simulation tional Methods in 1977, which summarized
framework. Sensitivity analyses in an a priori research on the interaction between aptitudes and
Monte Carlo simulation study allow the effect of instructional treatments, subsequently abbreviated
misspecified parameters, misspecified models, and/ as ATI research. Cronbach and Snow indicated that
or the validity of the assumptions on which the the term aptitude, rather than referring exclusively
method is based to be evaluated. The generality of to cognitive constructs, as had previously been the
the a priori Monte Carlo simulation studies is its case, was intended to refer to any student character-
biggest advantage. As Maxwell, Kelley, and Joseph istic. Cronbach stimulated research in this area in
Rausch have stated, ‘‘Sample size can be planned earlier publications suggesting that ATI research
for any research goal, on any statistical technique, was an ideal meeting point between the usually dis-
in any situation with an a priori Monte Carlo sim- tinct research traditions of correlational and experi-
ulation study’’ (2008, p. 553). mental psychology. Before the 1977 publication of
Aptitudes and Instructional Methods, ATI research
Ken Kelley was spurred by Cronbach and Snow’s technical
report summarizing the results of such studies,
See also Accuracy in Parameter Estimation; Monte Carlo
which was expanded in 1977 with the publication
Simulation; Power Analysis; Sample Size Planning
of the volume.
Further Readings
Background
Kelley, K. (2007). Sample size planning for the coefficient
of variation: Accuracy in parameter estimation via When asked about the effectiveness of different
narrow confidence intervals. Behavior Research treatments, educational researchers often respond
Methods, 39(4), 755–766. that ‘‘it depends’’ on the type of student exposed to
Kelley, K., & Maxwell, S. E. (2008). Power and accuracy the treatment, implying that the treatment interacted
for omnibus and targeted effects: Issues of sample size
with some student characteristic. Two types of inter-
planning with applications to multiple regression. In
P. Alasuuta, L. Bickman, & J. Brannen (Eds.), The
actions are important in ATI research: ordinal and
SAGE handbook of social research methods (pp. disordinal, as shown in Figure 1. In ordinal interac-
166–192). Thousand Oaks, CA: Sage. tions (top two lines in Figure 1), one treatment
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). yields superior outcomes at all levels of the student
Sample size planning for statistical power and characteristic, though the difference between the
Aptitudes and Instructional Methods 39
15 interest.
10 Cronbach and Snow pointed out that in
ANOVA designs, the student characteristic
5 examined was usually available as a continuous
0 score that had at least ordinal characteristics,
and the research groups were developed by split-
−5 ting the student characteristic distribution at
Student Characteristic
some point to create groups (high and low; high,
medium, and low; etc.). Such division into
Figure 1 Ordinal and Disordinal Interactions groups ignored student differences within each
group and reduced the available variance by an
estimated 34%. Cronbach and Snow recom-
outcomes is greater at one part of the distribution mended that research employ multiple linear
than elsewhere. In disordinal interactions (the bot- regression analysis in which the treatments
tom two lines in Figure 1), one treatment is superior would be represented by so-called dummy vari-
at one point of the student distribution while the ables and the student characteristic could be
other treatment is superior for students falling at analyzed as a continuous score. It should also be
another point. The slope difference in ordinal inter- noted, however, that when the research sample is
actions indicates that ultimately they are also likely at extreme ends of the distribution (e.g., one
to be disordinal, that is, the lines will cross at a fur- standard deviation above or below the mean),
ther point of the student characteristic distribution the use of ANOVA maximizes the possibility of
than observed in the present sample. finding differences between the groups.
employing various strategies and tactics. It turned self-teaching (e.g., mastering a new skill such as
out that there were no ubiquitous collections of typing), a clinical manipulation (e.g., a session of
individual characteristics that would always result massage) or long-term therapy (e.g., psychoanaly-
in success in a situation. Moreover, as systems of sis), or inspiring a soldier to fight a particular
intervention in education, work training in indus- battle (e.g., issuing an order) or preparing troops
try, and clinical fields developed, it became appar- to use new strategies of war (e.g., fighting insur-
ent that different interventions, although they gency). Aptitude is used to signify any systematic
might be focused on the same target (e.g., teaching measurable dimension of individual differences
children to read, training bank tellers to operate (or a combination of such) that is related to a par-
their stations, helping a client overcome depression, ticular treatment outcome. In other words,
or preparing soldiers for combat), clearly worked aptitude does not necessarily mean a level of gen-
differently for different people. It was then sug- eral cognitive ability or intelligence; it can capture
gested that the presence of differential outcomes of specific personality traits or transient psychologi-
the same intervention could be explained by apti- cal states. The most frequently studied aptitudes
tude-treatment interaction (ATI, sometimes also of ATI are in the categories of cognition,
abbreviated as AxT), a concept that was introduced conation, and affection, but aptitudes are not lim-
by Lee Cronbach in the second part of the 20th ited to these three categories. Finally, interaction
century. demarcates the degree to which the results of two
ATI methodology was developed to coaccount or more interventions will differ for people who
both for the individual characteristics of the inter- differ in one or more aptitudes. Of note is that
venee and the variations in the interventions while interaction here is defined statistically and that
assessing the extent to which alternative forms of both intervention and aptitude can be captured
interventions might have differential outcomes as by qualitative or quantitative variables (observed,
a function of the individual characteristics of the measured, self-reported, or derived). Also of note
person to whom the intervention is being deliv- is that, being a statistical concept, ATI behaves
ered. In other words, investigations of ATI have just as any statistical interaction does. Most
been designed to determine whether particular important, it can be detected only when studies
treatments can be selected or modified to optimally are adequately powered. Moreover, it acknowl-
serve individuals possessing particular characteris- edges and requires the presence of main effects of
tics (i.e., ability, personality, motivation). Today, the aptitude (it has to be a characteristic that
ATI is discussed in three different ways: as a con- matters for a particular outcome, e.g., general
cept, as a method for assessing interactions among cognitive ability rather than shoe size for predict-
person and situation variables, and as a framework ing a response to educational intervention) and
for theories of aptitude and treatment. the intervention (it has to be an effective treat-
ment that is directly related to an outcome,
e.g., teaching a concept rather than just giving
ATI as a Concept
students candy). This statistical aspect of ATI is
ATI as a concept refers to both an outcome and important for differentiating it from what is
a predictor of that outcome. Understanding these referred to by the ATI developers and proponents
facets of ATI requires decomposing the holistic as transaction. Transaction signifies the way
concept into its three components—treatment, in which ATI is constructed, the environment
aptitude, and the interaction between them. The and the process in which ATI emerges; in other
term treatment is used to capture any type of words, ATI is always a statistical result of a trans-
manipulation aimed at changing something. Thus, action through which a person possessing certain
with regard to ATI, treatment can refer to a spe- aptitudes experiences a certain treatment. ATI as
cific educational intervention (e.g., the teaching of an outcome identifies combinations of treatments
equivalent fractions) or conceptual pedagogical and aptitudes that generate a significant change
framework (e.g., Waldorf pedagogy), a particular or a larger change compared with other combina-
training (e.g., job-related activity, such as master- tions. ATI as a predictor points to which treat-
ing a new piece of equipment at a work place) or ment or treatments are more likely to generate
42 Aptitude-Treatment Interaction
significant or larger change for a particular indi- popularity lately is the regression discontinuity
vidual or individuals. design. In this design, the presence of ATI is regis-
tered when the same intervention is administered
before and after a particular event (e.g., a change
ATI as a Method
in aptitude in response to linguistic immersion
ATI as a method permits the use of multiple exper- while living in a country while continuing to study
imental designs. The very premise of ATI is its the language of that country).
capacity to combine correlational approaches (i.e.,
studies of individual differences) and experimental
approaches (i.e., studies of interventional manipu- ATI as a Theoretical Framework
lations). Multiple paradigms have been developed
to study ATI; many of them have been and con- ATI as a theoretical framework underscores the
tinue to be applied in other, non-ATI, areas of flexible and dynamic, rather than fixed and deter-
interventional research. In classical accounts of ministic, nature of the coexistence (or coaction) of
ATI, the following designs are typically mentioned. individual characteristics (i.e., aptitudes) and situa-
In a simple standard randomized between-persons tions (i.e., interventions). As a theory, ATI captures
design, the outcome is investigated for persons the very nature of variation in learning—not every-
who score at different levels of a particular apti- one learns equally well from the same method of
tude when multiple, distinct interventions are instruction, and not every method of teaching
compared. Having registered these differential out- works for everyone; in training—people acquire
comes, intervention selection is then carried out skills in a variety of ways; in therapy—not every-
based on a particular level of aptitude to optimize one responds well to a particular therapeutic
the outcome. Within this design, often, when ATI approach; and in organizational activities—not
is registered, it is helpful to carry out additional everyone prefers the same style of leadership. In
studies (e.g., case studies) to investigate the reason this sense, as a theoretical framework, ATI appeals
for the manifestation of ATI. The treatment revi- to professionals in multiple domains as it justifies
sion design assumes the continuous adjustment of the presence of variation in outcomes in class-
an intervention (or the creation of multiple parallel rooms, work environments, therapeutic settings,
versions of it) in response to how persons with dif- and battlefields. While applicable to all types and
ferent levels of aptitude react to each improvement levels of aptitudes and all kinds of interventions,
in the intervention (or alternative versions of the ATI is particularly aligned with more extreme
intervention). The point here is to optimize the levels of aptitudes, both low and high, and more
intervention by creating its multiple versions or its specialized interventions. The theory of ATI
multiple stages so that the outcome is optimized at acknowledges the presence of heterogeneity in
all levels of aptitude. This design has between- and both aptitudes and interventions, and its premise is
within-person versions, depending on the purposes to find the best possible combinations of the two
of the intervention that is being revised (e.g., to maximize the homogeneity of the outcome. A
ensuring that all children can learn equivalent frac- particular appeal of the theory is its transactional
tions regardless of their level of aptitude or ensur- nature and its potential to explain and justify both
ing the success of the therapy regardless of the success and failure in obtaining the desired out-
variability in depressive states of a client across come. As a theoretical framework, ATI does not
multiple therapy sessions). In the aptitude growth require the interaction to either be registered
design, the target of intervention is the level of empirically or be statistically significant. It calls for
aptitude. The idea here is that as the level of a theoretical examination of the aptitude and inter-
aptitude changes, different types of interventions ventional parameters whose interaction would best
might be used to optimize the outcome. This type explain the dynamics of learning, skill acquisition
of design is often used in combination with and demonstration, therapy, and leadership. The
growth-curve analyses. It can be applied as either beneficiaries of this kind of examination are of
between-persons or within-person designs. Finally, two kinds. First, it is the researchers themselves.
a type of design that has been gaining much Initially thinking through experiments and field
Aptitude-Treatment Interaction 43
studies before trying to confirm the existence of to fit groups of individuals, was revised. The ‘‘new
ATI empirically was, apparently, not a common view’’ of ATI, put forward by Cronbach in 1975,
feature of ATI studies during the height of their acknowledged that, although in existence, ATI is
popularity. Perhaps a more careful consideration much more complex and fluid than initially pre-
of the ‘‘what, how, and why’’ of measurement dicted and ATI’s dynamism and fluidity prevent
in ATI research would have prevented the observa- professionals from cataloging specific types of ATI
tion that many ATI findings resulted from and generalizing guidelines for prescribing different
somewhat haphazard fishing expeditions, and the interventions to people, given their aptitudes.
resulting views on ATI research would have been Although the usefulness of ATI as a theory has been
different. A second group of beneficiaries of ATI recognized, its features as a concept and as
studies are practitioners and policy makers. That a method have been criticized along the lines of (a)
there is no intervention that works for all, and that our necessarily incomplete knowledge of all possi-
one has to anticipate both successes and failures ble aptitudes and their levels, (b) the shortage of
and consider who will and who will not benefit good psychometric instruments that can validly and
from a particular intervention, are important reali- reliably quantify aptitudes, (c) the biases inherent in
zations to make while adopting a particular many procedures related to aptitude assessment
educational program, training package, therapeu- and intervention delivery, and (d) the lack of under-
tic approach, or organizational strategy, rather standing and possible registering of important
than in the aftermath. However, the warning ‘‘other’’ nonstatistical interactions (e.g., between
against embracing panaceas, made by Richard student and teacher, client and therapist, environ-
Snow, in interventional research and practice is ment and intervention). And yet ATI has never been
still just a warning, not a common presupposition. completely driven from the field, and there have
been steady references to the importance of ATI’s
framework and the need for better-designed empiri-
Criticism
cal studies of ATI.
Having emerged in the 1950s, interest in ATI
peaked in the 1970s and 1980s, but then dissi-
Gene × Environment Interaction
pated. This expansion and contraction were driven
by an initial surge in enthusiasm, followed by ATI has a number of neighboring concepts that
a wave of skepticism about the validity of ATI. also work within the general realm of qualifying
Specifically, a large-scope search for ATI, whose and quantifying individual differences in situations
presence was interpreted as being marked by dif- of acquiring new knowledge or new skills. Among
ferentiated regression slopes predicting outcomes these concepts are learning styles, learning strate-
from aptitudes for different interventions, or by gies, learning attitudes, and many interactive
the significance of the interaction terms in analysis effects (e.g., aptitude-outcome interaction). Quite
of variance models, was enthusiastically carried out often, the concept of ATI is discussed side by side
by a number of researchers. The accumulated data, with these neighboring concepts. Of particular
however, were mixed and often contradictory— interest is the link between the concept of ATI and
there were traces of ATI, but its presence and the concept of Gene × Environment interaction
magnitude were not consistently identifiable or rep- (G × E). The concept of G × E first appeared in
licable. Many reasons have been mentioned in dis- nonhuman research but gained tremendous popu-
cussions of why ATI is so elusive: underpowered larity in the psychological literature within the
studies, weak theoretical conceptualizations of ATI, same decade. Of note is that the tradition of its
simplistic research designs, imperfections in statisti- use in this literature is very similar to that of the
cal analyses, and the magnitude and even the non- usage of ATI; specifically, G × E also can be
existence of ATI, among others. As a result of this viewed as a concept, a method, and a theoretical
discussion, the initial prediction of the originator of framework. But the congruence between the two
ATI’s concept, Lee Cronbach, that interventions concepts is incomplete, of course; the concept of
designed for the average individual would be ulti- G × E adopts a very narrow definition of aptitude,
mately replaced by multiple parallel interventions in which individual differences are reduced to
44 Assent
genetic variation, and a very broad definition of Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic
treatment, in which interventions can be equated formative evaluation: A meta-analysis. Exceptional
with live events. Yet an appraisal of the parallels Children, 53, 199–208.
between the concepts of ATI and G × E is useful Grigorenko, E. L. (2005). The inherent complexities of
gene-environment interactions. Journal of
because it captures the field’s desire to engage
Gerontology, 60B, 53–64.
interaction effects for explanatory purposes when- Snow, R. E. (1984). Placing children in special education:
ever the explicatory power of main effects is Some comments. Educational Researcher, 13, 12–14.
disappointing. And it is interesting that the accu- Snow, R. E. (1991). Aptitude-treatment interaction as
mulation of the literature on G × E results in a set a framework for research on individual differences in
of concerns similar to those that interrupted the psychotherapy. Journal of Consulting & Clinical
golden rush of ATI studies in the 1970s. Psychology, 59, 205–216.
Yet methodological concerns aside, the concept Spearman, C. (1904). ‘‘General intelligence’’ objectively
of ATI rings a bell for all of us who have ever tried determined and measured. American Journal of
Psychology, 15, 201–293.
to learn anything in a group of people: what works
Violato, C. (1988). Interactionism in psychology and
for some of us will not work for the others as long
education: A new paradigm or source of confusion?
as we differ on even one characteristic that is rele- Journal of Education Thought, 22, 4–20.
vant to the outcome of interest. Whether it was
wit or something else by which Homer attempted
to differentiate Odysseus and Thersites, the poet
did at least successfully make an observation that ASSENT
has been central to many fields of social studies
and that has inspired the appearance of the con-
cept, methodology, and theoretical framework of The term assent refers to the verbal or written
ATI, as well as the many other concepts that agreement to engage in a research study. Assent is
capture the essence of what it means to be an indi- generally applicable to children between the ages
vidual in any given situation: that individual differ- of 8 and 18 years, although assent may apply to
ences in response to a common intervention exist. other vulnerable populations also.
Millennia later, it is an observation that still claims Vulnerable populations are those composed
our attention. of individuals who are unable to give consent due
to diminished autonomy. Diminished autonomy
Elena L. Grigorenko occurs when an individual is incapacitated, has
restricted freedom, or is a minor. Understanding
See also Effect Size, Measures of; Field Study; Growth the relevance of assent is important because with-
Curve; Interaction; Intervention; Power; Within- out obtaining the assent of a participant, the
Subjects Design researcher has restricted the freedom and auton-
omy of the participant and in turn has violated the
basic ethical principle of respect for persons.
Assent with regard to vulnerable populations is
Further Readings discussed here, along with the process of obtaining
Cronbach, L. J. (1957). The two disciplines of scientific assent and the role of institutional review boards
psychology. American Psychologist, 12, 671–684. in the assent process.
Cronbach, L. J. (1975). Beyond the two disciplines of
scientific psychology. American Psychologist, 30,
116–127. Vulnerable Populations
Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and
Respect for persons requires that participants agree
instructional methods: A handbook for research on
interactions. Irvington, UK: Oxford University Press.
to engage in research voluntarily and have ade-
Dance, K. A., & Neufeld, R. W. J. (1988). Aptitude- quate information to make an informed decision.
treatment interaction research in the clinical setting: A Most laws recognize that a person 18 years of age
review of attempts to dispel the ‘‘patient uniformity’’ or older is able to give his or her informed consent
myth. Psychological Bulletin, 194, 192–213. to participate in the research study. However, in
Assent 45
some cases individuals lack the capacity to provide child gives permission for the child to attend
informed consent. An individual may lack the a social skills group for socially anxious children,
capacity to give his or her consent for a variety of but the child does not assent to treatment, the
reasons; examples include a prisoner who is child may be enrolled in the group without his or
ordered to undergo an experimental treatment her assent. However, it is recommended that assent
designed to decrease recidivism, a participant be obtained whenever possible. Further, if a child
with mental retardation, or an older adult with does not give assent initially, attempts to obtain
dementia whose caretakers believe an experimen- assent should continue throughout the research.
tal psychotherapy group may decrease his or Guidelines also suggest that assent may be over-
her symptoms. Each of the participants in these looked in cases in which the possible benefits of
examples is not capable of giving permission to the research outweigh the costs. For example, if
participate in the research because he or she either one wanted to study the effects of a life-saving
is coerced into engaging in the research or lacks drug for children and the child refused the medica-
the ability to understand the basic information tion, the benefit of saving the child’s life outweighs
necessary to fully consent to the study. the cost of not obtaining assent. Assent may be
State laws prohibit minors and incapacitated overlooked in cases in which assent of the partici-
individuals from giving consent. In these cases, pants is not feasible, as would be the case of
permission must be obtained from parents and a researcher interested in studying children who
court-appointed guardians, respectively. How- died as a result of not wearing a seatbelt.
ever, beyond consent, many ethicists, profes- Obtaining assent is an active process whereby
sional organizations, and ethical codes require the participant and the researcher discuss the
that assent be obtained. With children, state requirements of the research. In this case, the par-
laws define when a young person is legally com- ticipant is active in the decision making. Passive
petent to make informed decisions. Some argue consent, a concept closely associated with assent
that the ability to give assent is from 8 to 14 and consent, is the lack of protest, objection, or
years of age because the person is able to com- opting out of the research study and is considered
prehend the requirements of the research. In gen- permission to continue with the research.
eral, however, it is thought that by the age of 10,
children should be able to provide assent to par-
ticipate. It is argued that obtaining assent Institutional Review Boards
increases the autonomy of the individual. By Institutional review boards frequently make
obtaining assent, individuals are afforded as requirements as to the way assent is to be obtained
much control as possible over their decision to and documented. Assent may be obtained either
engage in the research given the circumstances, orally or in writing and should always be docu-
regardless of their mental capacity. mented. In obtaining assent, the researcher pro-
vides the same information as is provided to an
Obtaining Assent individual from whom consent is requested. The
language level and details may be altered in order
Assent is not a singular event. It is thought that to meet the understanding of the assenting partici-
assent is a continual process. Thus, researchers are pant. Specifically, the participant should be
encouraged to obtain permission to continue with informed of the purpose of the study; the time
the research during each new phase of research necessary to complete the study; as well as the
(e.g., moving from one type of task to the next). If risks, benefits, and alternatives to the study or
an individual assents to participate in the study treatment. Participants should also have access to
but during the study requests to discontinue, it is the researcher’s contact information. Finally, limits
recommended that the research be discontinued. of confidentiality should be addressed. This is par-
Although obtaining assent is strongly recom- ticularly important for individuals in the prison
mended, failure to obtain assent does not necessar- system and for children.
ily preclude the participant from engaging in the
research. For example, if the parent of a 4-year-old Tracy J. Cohn
46 Association, Measures of
See also Debriefing; Ethics in the Research Process; determine the appropriate statistical technique or
Informed Consent; Interviewing test that is needed to establish the existence of an
association. If the statistical test shows a conclusive
Further Readings association that is unlikely to occur by random
chance, different types of regression models can be
Belmont report: Ethical principles and guidelines for the used to quantify how change in exposure to a vari-
protection of human subjects of research. (1979). able relates to the change in the outcome variable
Washington, DC: U.S. Government Printing Office.
of interest.
Grisso, T. (1992). Minors’ assent to behavioral research
without parental consent. In B. Stanley & J. E. Sieber
(Eds.), Social research on children and adolescents:
Ethical issues (pp. 109–127). Newbury Park, CA: Examining Association Between Continuous
Sage. Variables With Correlation Analyses
Miller, V. A., & Nelson, R. M. (2006). A developmental
approach to child assent for nontherapeutic research. Correlation is a measure of association between
Journal of Pediatrics, 150(4), 25–30. two variables that expresses the degree to which
Ross, L. F. (2003). Do healthy children deserve greater the two variables are rectilinearly related. If the
protection in medical research? Journal of Pediatrics, data do not follow a straight line (e.g., they follow
142(2), 108–112.
a curve), common correlation analyses are not
appropriate. In correlation, unlike regression anal-
ysis, there are no dependent and independent
ASSOCIATION, MEASURES OF variables.
When both variables are measured as discrete
Measuring association between variables is very or continuous variables, it is common for research-
relevant for investigating causality, which is, in ers to examine the data for a correlation between
turn, the sine qua non of scientific research. these variables by using the Pearson product-
However, an association between two variables moment correlation coefficient (r). This coefficient
does not necessarily imply a causal relationship, has a value between 1 and þ 1 and indicates
and the research design of a study aimed at the strength of the association between the two
investigating an association needs to be carefully variables. A perfect correlation of ± 1 occurs only
considered in order for the study to obtain valid when all pairs of values (or points) fall exactly on
information. Knowledge of measures of associa- a straight line.
tion and the related ideas of correlation, regres- A positive correlation indicates in a broad way
sion, and causality are cornerstone concepts in that increasing values of one variable correspond
research design. This entry is directed at to increasing values in the other variable. A nega-
researchers disposed to approach these concepts tive correlation indicates that increasing values in
in a conceptual way. one variable corresponds to decreasing values in
the other variable. A correlation value close to
0 means no association between the variables. The
Measuring Association
r provides information about the strength of the
In scientific research, association is generally correlation (i.e., the nearness of the points to
defined as the statistical dependence between two a straight line). Figure 1 gives some examples of
or more variables. Two variables are associated if correlations, correlation coefficients, and related
some of the variability of one variable can be regression lines.
accounted for by the other, that is, if a change in A condition for estimating correlations is that
the quantity of one variable conditions a change in both variables must be obtained by random sam-
the other variable. pling from the same population. For example, one
Before investigating and measuring association, can study the correlation between height and
it is first appropriate to identify the types of vari- weight in a sample of children but not the correla-
ables that are being compared (e.g., nominal, ordi- tion between height and three different types of
nal, discrete, continuous). The type of variable will diet that have been decided by the investigator. In
Association, Measures of 47
When planning a research design, it is always association is modified by a third variable. The
preferable to perform a prospective study because study sample may be different from the rest of the
it identifies the exposure before any individual has population (e.g., only men or only healthy people),
developed the outcome. If one observes an associa- but this situation does not necessarily convey that
tion in a cross-sectional design, one can never be the results obtained are biased and cannot be
sure of the direction of the association. For exam- applied to the general population. Many random-
ple, low income is associated with impaired health ized clinical trials are performed on a restricted
in cross-sectional studies, but it is not known sample of individuals, but the results are actually
whether bad health leads to low income or the generalizable to the whole population. However, if
opposite. As noted by Austin Bradford Hill, the there is an interaction between variables, the effect
existence of a temporal relationship is the main cri- modification that this interaction produces must be
terion for distinguishing causality from association. considered. For example, the association between
Other relevant criteria pointed out by this author exposure to asbestos and lung cancer is much more
are consistency, strength, specificity, dose-response intense among smokers than among nonsmokers.
relationship, biological plausibility, and coherence. Therefore, a study on a population of nonsmokers
would not be generalizable to the general popula-
Bias and Random Error in Association Studies tion. Failure to consider interactions may even ren-
der associations spurious in a sample that includes
When planning a study design for investigating
the whole population. For example, a drug may
causal associations, one needs to consider the pos-
increase the risk of death in a group of patients but
sible existence of random error, selection bias,
decrease this risk in other different groups of
information bias, and confounding, as well as the
patients. However, an overall measure would show
presence of interactions or effect modification and
no association since the antagonistic directions of
of mediator variables.
the underlying associations compensate each other.
Bias is often defined as the lack of internal
validity of the association between exposure and
outcome variable of interest. This is in contrast to Information Bias
external validity, which concerns generalizability
Information bias simply arises because informa-
of the association to other populations. Bias can
tion collected on the variables is erroneous. All
also be defined as nonrandom or systematic differ-
variables must be measured correctly; otherwise,
ence between an estimate and the true value of the
one can arrive at imprecise or even spurious
population.
associations.
Random Error
Confounding
When designing a study, one always needs to
include a sufficient number of individuals in the An association between two variables can be
analyses to achieve appropriate statistical power confounded by a third variable. Imagine, for exam-
and ensure that conclusive estimates of association ple, that one observes an association between the
can be obtained. Suitable statistical power is espe- existence of yellow nails and mortality. The causal-
cially relevant when it comes to establishing the ity of this association could be plausible. Since nail
absence of association between two variables. tissue stores body substances, the yellow coloration
Moreover, when a study involves a large number might indicate poisoning or metabolic disease that
of individuals, more information is available. causes an increased mortality. However, further
More information lowers the random error, which investigation would indicate that individuals with
in turn increases the precision of the estimates. yellow nails were actually heavy smokers. The
habit of holding the cigarette between the fingers
discolored their nails, but the cause of death was
Selection Bias
smoking. That is, smoking was associated with
Selection bias can occur if the sample differs both yellow nails and mortality and originated
from the rest of the population and if the observed a confounded association (Figure 2).
Autocorrelation 51
Confounded association
Yellow nails Death
Further Readings
Altman, D. G. (1991). Practical statistics for medical
research. New York: Chapman & Hall/CRC.
Figure 2 Deceptive Correlation Between Yellow
Hernan, M. A., Hernandez-Diaz, S., Werler, M. M., &
Nails and Mortality
Mitchell, A. A. (2002). Causal knowledge as
Note: Because smoking is associated with both yellow nails a prerequisite for confounding evaluation: An
and mortality, it originated a confounded association application to birth defects epidemiology. American
between yellow nails and mortality Journal of Epidemiology, 155(2), 176–184.
Hernan, M. A., & Robins, J. M. (2006). Instruments for
Low income Smoking Death causal inference: An epidemiologist’s dream?
Epidemiology, 17(4), 360–372.
Mediator
Hill, A. B. (1965). Environment and disease: Association
or causation? Proceedings of the Royal Society of
Medicine, 58, 295–300.
Figure 3 Smoking Acts as a Mediator Between
Jaakkola, J. J. (2003). Case-crossover design in air
Income and Early Death
pollution epidemiology. European Respiratory
Note: Heavy smoking mediated the effect of low income on Journal, 40, 81s-85s.
mortality. Last, J. M. (Ed.). (2000). A dictionary of epidemiology
(4th ed.). New York: Oxford University Press.
Mediation Liebetrau, A. M. (1983). Measures of association.
Newbury Park, CA: Sage.
In some cases an observed association is medi- Lloyd, F. D., & Van Belle, G. (1993). Biostatistics: A
ated by an intermediate variable. For example, methodology for the health sciences. New York:
individuals with low income present a higher risk Wiley.
of early death than do individuals with high Oakes, J. M., & Kaufman J. S. (Eds.). (2006). Methods in
income. Simultaneously, there are many more social epidemiology. New York: Wiley.
heavy smokers among people with low income. In Rothman, K. J. (Ed.). (1988). Causal inference. Chestnut
this case, heavy smoking mediates the effect of low Hill, MA: Epidemiology Resources.
income on mortality. Susser, M. (1991). What is a cause and how do we know
one? A grammar for pragmatic epidemiology.
Distinguishing which variables are confounders
American Journal of Epidemiology, 33, 635–648.
and which are mediators cannot be done by statis-
tical techniques only. It requires previous knowl-
edge, and in some cases variables can be both
confounders and mediators.
AUTOCORRELATION
Directed Acyclic Graphs Autocorrelation describes sample or population
Determining which variables are confounders, observations or elements that are related to each
intermediates, or independently associated variables other across time, space, or other dimensions. Cor-
can be difficult when many variables are involved. related observations are common but problematic,
Directed acyclic graphs use a set of simple rules to largely because they violate a basic statistical
create a visual representation of direct and indirect assumption about many samples: independence
associations of covariates and exposure variables across elements. Conventional tests of statistical
with the outcome. These graphs can help research- significance assume simple random sampling, in
ers understand possible causal relationships. which not only each element has an equal chance
of selection but also each combination of elements
Juan Merlo and Kristian Lynch has an equal chance of selection; autocorrelation
52 Autocorrelation
violates this assumption. This entry describes com- the same unit at an earlier time, frequently one
mon sources of autocorrelation, the problems it period removed (often called t 1).
can cause, and selected diagnostics and solutions.
• Spatial correlation occurs in cluster samples
(e.g., classrooms or neighborhoods): Physically
Sources
adjacent elements have a higher chance of entering
What is the best predictor of a student’s 11th- the sample than do other elements. These adjacent
grade academic performance? His or her 10th- elements are typically more similar to already sam-
grade grade point average. What is the best predic- pled cases than are elements from a simple random
tor of this year’s crude divorce rate? Usually last sample of the same size.
year’s divorce rate. The old slogan ‘‘Birds of
a feather flock together’’ describes a college class- • A variation of spatial correlation occurs with
room in which students are about the same age, contagion effects, such as crime incidence (burglars
at the same academic stage, and often in the ignore city limits in plundering wealthy neighbor-
same disciplinary major. That slogan also describes hoods) or an outbreak of disease.
many residential city blocks, where adult inhabi-
tants have comparable incomes and perhaps even • Multiple (repeated) measures administered to
similar marital and parental status. When examin- the same individual at approximately the same
ing the spread of a disease, such as the H1N1 time (e.g., a lengthy survey questionnaire with
influenza, researchers often use epidemiological many Likert-type items in agree-disagree format).
maps showing concentric circles around the initial
outbreak locations.
Autocorrelation Terms
All these are examples of correlated observa-
tions, that is, autocorrelation, in which two indivi- The terms positive or negative autocorrelation
duals from a classroom or neighborhood cluster, often apply to time-series data. Societal inertia can
cases from a time series of measures, or proximity inflate the correlation of observed measures across
to a contagious event resemble each other more time. The social forces creating trends such as fall-
than two cases drawn from the total population of ing marriage rates or rising gross domestic product
elements by means of a simple random sample. often carry over from one period into the next.
Correlated observations occur for several reasons: When trends continue over time (e.g., a student’s
grades), positive predictions can be made from one
• Repeated, comparable measures are taken on period to the next, hence the term positive
the same individuals over time, such as many pre- autocorrelation.
test and posttest experimental measures or panel However, forces at one time can also create
surveys, which reinterview the same individual. compensatory or corrective mechanisms at the
Because people remember their prior responses or next, such as consumers’ alternating patterns of
behaviors, because many behaviors are habitual, ‘‘save, then spend’’ or regulation of production
and because many traits or talents stay relatively based on estimates of prior inventory. The data
constant over time, these repeated measures points seem to ricochet from one time to the next,
become correlated for the same person. so adjacent observations are said to be negatively
correlated, creating a cobweb-pattern effect.
• Time-series measures also apply to larger The order of the autocorrelation process refer-
units, such as birth, divorce, or labor force partici- ences the degree of periodicity in correlated obser-
pation rates in countries or achievement grades in vations. When adjacent observations are
a county school system. Observations on the same correlated, the process is first-order autoregression,
variable are repeated on the same unit at some or AR (1). If every other observation, or alternate
periodic interval (e.g., annual rate of felony observations, is correlated, this is an AR (2) pro-
crimes). The units transcend the individual, and cess. If every third observation is correlated, this is
the periodicity of measurement is usually regular. an AR (3) process, and so on. The order of the
A lag describes a measure of the same variable on process is important, first because the most
Autocorrelation 53
Diagnosing First-Order Autocorrelation depend on the number of cases and the number
of predictor variables. The d calculation cannot
There are several ways to detect first-order
be used with regressions through the origin,
autocorrelation in least squares analyses. Pairs of
with standardized regression equations, or with
adjacent residuals can be plotted against time (or
equations that include lags of the dependent vari-
space) and the resulting scatterplot examined.
able as predictors.
However, the scatterplot ‘‘cloud of points’’
Many other computer programs provide
mentioned in most introductory statistics texts
iterative estimates of ρ and its standard error,
often resembles just that, especially with large
and sometimes the Durbin–Watson d as well.
samples. The decision is literally based on an ‘‘eye-
Hierarchical linear models and time-series analysis
ball’’ analysis.
programs are two examples. The null hypothesis
Second, and more formally, the statistical signif-
ρ ¼ 0 can be tested through a t-distribution with
icance of the number of positive and negative runs
the ratio
or sign changes in the residuals can be tested.
Tables of significance tests for the runs test are ρ=seρ :
available in many statistics textbooks. The situa-
tion of too many runs means the adjacent residuals
The t value can be evaluated using the t tables if
have switched signs too often and oscillate, result-
needed. If ρ is not statistically significant, there is
ing in a diagnosis of negative autocorrelation. The
no first-order autocorrelation. If the analyst is will-
situation of too few runs means long streams of
ing to specify the positive or negative direction of
positive or negative trends, thus suggesting positive
the autocorrelation in advance, one-tailed tests of
autocorrelation. The number of runs expected in
statistical significance are available.
a random progression of elements depends on the
number of observations. Most tables apply to rela-
tively small sample sizes, such as N < 40. Since Possible Solutions
many time series for social trends are relatively
When interest centers on a time series and the lag
short in duration, depending on the availability of
of the dependent variable, it is tempting to attempt
data, this test can be more practical than it initially
solving the autocorrelation problem by simply
appears.
including a lagged dependent variable (e.g., yt1 )
One widely used formal diagnostic for first-
as a predictor in OLS regression or as a covariate
order autocorrelation is the Durbin-Watson d sta-
in analysis of covariance. Unfortunately, this alter-
tistic, which is available in many statistical com-
native creates a worse problem. Because the obser-
puter programs. The d statistic is approximately
vations are correlated, the residual term e is now
calculated as 2ð1 ρÞ where ρet et1 is the intra-
correlated back with yt1 , which is a predictor for
class correlation coefficient. The et can be defined
the regression or analysis of covariance. Not only
as adjacent residuals (in the following formula, v
does this alternative introduce bias into the previ-
represents the true random error terms that one
ously unbiased B coefficient estimates, but using
really wants to estimate):
lags also invalidates the use of diagnostic tests such
et ¼ ρet 1 þ vt as the Durbin-Watson d.
The first-differences (Cochrane-Orcutt) solution
Thus d is a ratio of the sum of squared differ- is one way to correct autocorrelation. This gener-
ences between adjacent residuals to the sum alized least squares (GLS) solution creates a set of
of squared residuals. The d has an interesting sta- new variables by subtracting from each variable
tistical distribution: Values near 2 imply ρ ¼ 0 (no (not just the dependent variable) its own t 1 lag
autocorrelation); d is 0 when ρ ¼ 1 (extreme or adjacent case. Then each newly created variable
positive autocorrelation) and 4 when ρ ¼ 1 in the equation is multiplied by the weight
(extreme negative autocorrelation). In addition, (1 ρ) to make the error terms behave randomly.
d has two zones of indecision (one near 0 and one An analyst may also wish to check for higher
near 4), in which the null hypothesis ρ ¼ 0 is nei- order autoregressive processes. If a GLS solution
ther accepted nor rejected. The zones of indecision was created for the AR (1) autocorrelation, some
Autocorrelation 55
statistical programs will test for the statistical sig- computer programs exist, either freestanding or
nificance of ρ using the Durbin-Watson d for the within larger packages, such as the Statistical
reestimated GLS equation. If ρ does not equal 0, Package for the Social Sciences (SPSS; an IBM
higher order autocorrelation may exist. Possible company, formerly called PASWâ Statistics).
solutions here include logarithmic or polynomial Autocorrelation is an unexpectedly common
transformations of the variables, which may atten- phenomenon that occurs in many social and
uate ρ. The analyst may also wish to examine behavioral science phenomena (e.g., psychologi-
econometrics programs that estimate higher order cal experiments or the tracking of student devel-
autoregressive equations. opment over time, social trends on employment,
In the Cochrane-Orcutt solution, the first obser- or cluster samples). Its major possible conse-
vation is lost; this may be problematic in small quence—leading one to believe that accidental
samples. The Prais-Winsten approximation has sample fluctuations are statistically significant—
been used to estimate the first observation in case is serious. Checking and correcting for autocor-
of bivariate correlation or regression (with a loss relation should become a more automatic
of one additional degree of freedom). process in the data analyst’s tool chest than it
In most social and behavioral science data, once currently appears to be.
autocorrelation is corrected, conclusions about the
statistical significance of the results become much Susan Carol Losh
more conservative. Even when corrections for ρ
See also Cluster Sampling; Hierarchical Linear Modeling;
have been made, some statisticians believe that R2s
Intraclass Correlation; Multivariate Analysis of
or η2s to estimate the total explained variance in
Variance (MANOVA); Time-Series Study
regression or analysis of variance models are
invalid if autocorrelation existed in the original
analyses. The explained variance tends to be quite Further Readings
large under these circumstances, reflecting the
covariation of trends or behaviors. Bowerman, B. L. (2004). Forecasting, time series, and
Several disciplines have other ways of handling regression (4th ed.). Pacific Grove, CA: Duxbury Press.
autocorrelation. Some alternate solutions are Gujarati, D. (2009). Basic econometrics (5th ed.). New
York: McGraw-Hill.
paired t tests and multivariate analysis of variance
Luke, D. A. (2004). Multilevel modeling. Thousand
for either repeated measures or multiple dependent Oaks, CA: Sage.
variables. Econometric analysts diagnose treat- Menard, S. W. (2002). Longitudinal research (2nd ed.).
ments of higher order periodicity, lags for either Thousand Oaks, CA: Sage.
predictors or dependent variables, and moving Ostrom, C. W., Jr. (1990), Time series analysis: Regression
averages (often called ARIMA). Specialized techniques (2nd ed.). Thousand Oaks, CA: Sage.
B
year as a series of 34 bars, one for each of
BAR CHART the imports and exports of Scotland’s 17 trading
partners. However, his innovation was largely
ignored in Britain for a number of years. Playfair
The term bar chart refers to a category of dia- himself attributed little value to his invention,
grams in which values are represented by the apologizing for what he saw as the limitations of
height or length of bars, lines, or other symbolic the bar chart. It was not until 1801 and the publi-
representations. Bar charts are typically used to cation of his Statistical Breviary that Playfair rec-
display variables on a nominal or ordinal scale. ognized the value of his invention. Playfair’s
Bar charts are a very popular form of information invention fared better in Germany and France. In
graphics often used in research articles, scientific 1811 the German Alexander von Humboldt pub-
reports, textbooks, and popular media to visually lished adaptations of Playfair’s bar graph and pie
display relationships and trends in data. However, charts in Essai Politique sur le Royaume de la
for this display to be effective, the data must be Nouvelle Espagne. In 1821, Jean Baptiste Joseph
presented accurately, and the reader must be able Fourier adapted the bar chart to create the first
to analyze the presentation effectively. This entry graph of cumulative frequency distribution,
provides information on the history of the bar referred to as an ogive. In 1833, A. M. Guerry
charts, the types of bar charts, and the construc- used the bar chart to plot crime data, creating the
tion of a bar chart. first histogram. Finally, in 1859 Playfair’s work
began to be accepted in Britain when Stanley
History Jevons published bar charts in his version of an
economic atlas modeled on Playfair’s earlier work.
The creation of the first bar chart is attributed to Jevons in turn influenced Karl Pearson, commonly
William Playfair and appeared in The Commercial considered the ‘‘father of modern statistics,’’ who
and Political Atlas in 1786. Playfair’s bar graph promoted the widespread acceptance of the bar
was an adaptation of Joseph Priestley’s time-line chart and other forms of information graphics.
charts, which were popular at the time. Ironically,
Playfair attributed his creation of the bar graph to
a lack of data. In his Atlas, Playfair presented 34
Types
plates containing line graphs or surface charts
graphically representing the imports and exports Although the terms bar chart and bar graph are
from different countries over the years. Since he now used interchangeably, the term bar chart was
lacked the necessary time-series data for Scotland, reserved traditionally for corresponding displays
he was forced to graph its trade data for a single that did not have scales, grid lines, or tick marks.
57
58 Bar Chart
Total Earnings for Various Companies for the Year 2007 U.S. dollars, and the widths of the
9
bars are used to represent the per-
centage of the earnings coming
8 from exports. The information
7 expressed by the bar width can be
displayed by means of a scale on
6
Millions of USD
Total Earnings and Percentage Earnings From Exports contribution of the components of
7
the category. If the graph represents
the separate components’ percentage
6 20%
of the whole value rather than the
actual values, this graph is com-
5
monly referred to as a 100% stacked
40% bar graph. Lines can be drawn to
Millions of USD
4
connect the components of a stacked
bar graph to more clearly delineate
3 50% the relationship between the same
components of different categories.
30%
2
A stacked bar graph can also use
75% only one bar to demonstrate the con-
1
tribution of the components of only
one category, condition, or occasion,
0
in which case it functions more like
Company A Company B Company C Company D Company E a pie chart. Two data series can also
be plotted together in a paired bar
Percentage graph, also referred to as a sliding
of Earnings bar or bilateral bar graph. This
Earnings from
(USD) Exports
graph differs from a clustered bar
graph because rather than being
Company A 4.2 40
plotted side by side, the values for
Company B 2.1 30 one data series are plotted with hori-
Company C 1.5 75 zontal bars to the left and the values
Company D 5.7 20 for the other data series are plotted
Company E 2.9 50 with horizontal bars to the right.
The units of measurement and scale
intervals for the two data series need
Figure 2 Area Bar Graph and Associated Data not be the same, allowing for a visual
Note: USD ¼ U.S. dollars. display of correlations and other
meaningful relationships between
the two data series. A paired bar
While there is no limit to the number of series graph can be a variation of either a simple, clus-
that can be plotted on the same graph, it is wise to tered, or stacked bar graph. A paired bar graph
limit the number of series plotted to no more than without spaces between the bars is often called
four in order to keep the graph from becoming con- a pyramid graph or a two-way histogram. Another
fusing. To reduce the size of the graph and to method for comparing two data series is the differ-
improve readability, the bars for separate categories ence bar graph. In this type of bar graph, the bars
can be overlapped, but the overlap should be less represent the difference in the values of two data
than 75% to prevent the graph from being mis- series. For instance, one could compare the perfor-
taken for a stacked bar graph. A stacked bar graph, mance of two different classes on a series of tests or
also called a divided or composite bar graph, has compare the different performance of males and
multiple series stacked end to end instead of side by females on a series of assessments. The direction of
side. This graph displays the relative contribution of the difference can be noted at the ends of bars or by
the components of a category; a different color, labeling the bars. When comparing multiple factors
shade, or pattern differentiates each component, as at two points in time or under two different condi-
described in a legend. The end of the bar represents tions, one can use a change bar graph. The bars in
the value of the whole category, and the heights of this graph are used to represent the change between
the various data series represent the relative the two conditions or times. Since the direction of
60 Bar Chart
6
ever, is more intuitive for displaying
5 amount or quantity, and a horizon-
4
tal presentation makes more sense
for displaying distance or time. A
3 horizontal presentation also allows
2 for more space for detailed labeling
of the categorical axis. The choice
1
of an appropriate scale is critical
0 for accurate presentation of data in
Company A Company B Company C Company D Company E
a bar graph. Simple changes in the
starting point or the interval of
2006 2007 2008
a scale can make the graph look
Company A 0.5 1 1.3
dramatically different and may pos-
Company B 3 2.1 2.1
sibly misrepresent the relationships
Company C 5 6.9 8.2
within the data. The best method
Company D 3 3.5 1.5
for avoiding this problem is to
Company E 2 4.5 3.1 always begin the quantitative scale
at 0 and to use a linear rather than
Figure 3 Clustered Bar Chart and Associated Data
a logarithmic scale. However, in
cases in which the values to be
Note: USD ¼ U.S. dollars. represented are extremely large,
a start value of 0 effectively hides
change is usually important with these types of any differences in the data because by necessity the
graphs, a coding system is used to indicate the intervals must be extremely wide. In these cases it
direction of the change. is possible to maintain smaller intervals while still
starting the scale at 0 by the use of a clearly
marked scale break. Alternatively, one can high-
light the true relationship between the data by
Creating an Effective Bar Chart
starting the scale at 0 and adding an inset of
A well-designed bar chart can effectively communi- a small section of the larger graph to demonstrate
cate a substantial amount of information relatively the true relationship. Finally, it is important to
easily, but a poorly designed graph can create con- make sure the graph and its axes are clearly
fusion and lead to inaccurate conclusions among labeled so that the reader can understand what
readers. Choosing the correct graphing format or data are being presented. Modern technology
technique is the first step in creating an effective allows the addition of many superfluous graphical
graphical presentation of data. Bar charts are best elements to enhance the basic graph design.
used for making discrete comparisons between sev- Although the addition of these elements is a matter
eral categorical variables because the eye can spot of personal choice, it is important to remember
very small differences in relative height. However, that the primary aim of data graphics is to display
a bar chart works best with four to six categories; data accurately and clearly. If the additional ele-
attempting to display more than six categories on ments detract from this clarity of presentation,
a bar graph can lead to a crowded and confusing they should be avoided.
graph. Once an appropriate graphing technique
has been chosen, it is important to choose the Teresa P. Clark and Sara E. Bolt
Bartlett’s Test 61
See also Box-and-Whisker Plot; Distribution; Graphical known as Bartlett’s test, and Bartlett’s test statistic
Display of Data; Histogram; Pie Chart is given by
Qr
w
Further Readings ðS2i Þ i
‘1 ¼ i¼1
P r ,
Cleveland, W. S., & McGill, R. (1985). Graphical wi Si2
perception and graphical methods for analyzing i¼1
scientific data. Science, 229, 828–833.
Harris, R. L. (1999). Information graphics: A where wi ¼ ðni 1Þ=ðN rÞ is known as the
comprehensive illustrated reference. New York: P
r
weight for the ith group and N ¼ ni is the sum
Oxford University Press. i¼1
Playfair, W. (1786). The commercial and political atlas. of the individual sample sizes. In the equireplicate
London: Corry. case (i.e., n1 ¼ ¼ nr ¼ n), the weights are
Shah, P., & Hoeffner, J. (2002). Review of graph
equal, and wi ¼ 1=r for each i ¼ 1, . . . , r: The test
comprehension research: Implications for instruction.
statistic is the ratio of the weighted geometric
Educational Psychology Review, 14, 47–69.
Spence, I. (2000). The invention and use of statistical mean of the group sample variances to their
charts. Journal de la Société Francaise de Statistique, weighted arithmetic mean. The values of the test
141, 77–81. statistic are bounded as 0 ≤ ‘1 ≤ 1 by Jensen’s
Tufte, E. R. (1983). The visual display of quantitative inequality. Large values of 0 ≤ ‘1 ≤ 1 (i.e., values
information. Cheshire, CT: Graphics. near 1) indicate agreement with the null hypothe-
Wainer, H. (1996). Depicting error. American Statistician, sis, whereas small values indicate disagreement
50, 101–111. with the null. The terminology ‘1 is used to indi-
cate that Bartlett’s test is based on M. S. Bartlett’s
modification of the likelihood ratio test, wherein he
BARTLETT’S TEST replaced the sample sizes ni with their correspond-
ing degrees of freedom, ni 1. Bartlett did so to
make the test unbiased. In the equireplicate case,
The assumption of equal variances across treatment
Bartlett’s test and the likelihood ratio test result in
groups may cause serious problems if violated in
the same test statistic and same critical region.
one-way analysis of variance models. A common
The distribution of ‘1 is complex even when the
test for homogeneity of variances is Bartlett’s test.
null hypothesis is true. R. E. Glaser showed
This statistical test checks whether the variances
that the distribution of ‘1 could be expressed as
from different groups (or samples) are equal.
a product of independently distributed beta ran-
Suppose that there are r treatment groups and
dom variables. In doing so he renewed much
we want to test
interest in the exact distribution of Bartlett’s test.
‘1 ≤ bα ðn1 , . . . , nr Þ where
H0 : σ 21 ¼ σ 22 ¼ ¼ σ 2r We reject H0 provided
Pr ‘1 < bα ðn1 , . . . , nr Þ ¼ α when H 0 is true. The
versus Bartlett critical value bα ðn1 , . . . , nr Þ is indexed by
H1 : σ 2m 6¼ σ 2k for some m 6¼ k: level of significance and the individual sample
sizes. The critical values were first tabled in the
In this context, we assume that we have inde- equireplicate case, and the critical value was sim-
pendently chosen random samples of size plified to bα ðn, . . . , nÞ ¼ bα ðnÞ. Tabulating critical
ni , i ¼ 1, . . . , r from each
of the
r independent values with unequal sample sizes becomes counter-
populations. Let Xij e N μi , σ 2i be independently productive because of possible combinations of
distributed with a normal distribution having mean groups, sample sizes, and levels of significance.
μi and variance σ 2i for each j ¼ 1, . . . , ni and each
i ¼ 1, . . . , r. Let Xi be the sample mean and S2i the
Example
sample variance of the sample taken from the
ith group or population. The uniformly most pow- Consider an experiment in which lead levels are
erful unbiased parametric test of size α for testing measured at five different sites. The data in Table 1
for equality of variances among r populations is come from Paul Berthouex and Linfield Brown:
62 Bartlett’s Test
Table 1 Ten Measurements of Lead Concentration (mG=L) Measured on Waste Water Specimens
Measurement No.
Lab 1 2 3 4 5 6 7 8 9 10
1 3.4 3.0 3.4 5.0 5.1 5.5 5.4 4.2 3.8 4.2
2 4.5 3.7 3.8 3.9 4.3 3.9 4.1 4.0 3.0 4.5
3 5.3 4.7 3.6 5.0 3.6 4.5 4.6 5.3 3.9 4.1
4 3.2 3.4 3.1 3.0 3.9 2.0 1.9 2.7 3.8 4.2
5 3.3 2.4 2.7 3.2 3.3 2.9 4.4 3.4 4.8 3.0
Source: Berthouex, P. M., & Brown, L. C. (2002). Statistics for environmental engineers (2nd ed., p. 170). Boca Raton,
FL: Lewis.
From these data one can compute the sample So for the lead levels, we have the following
variances and weights, which are given below: values: b0:05 ð5; 10Þ¼0:8025.
_ At the 5% level of
significance, there is not enough evidence to reject
Labs Weight Variance the null hypothesis of equal variances.
1 0.2 0.81778 Because of the complexity of the distribution
2 0.2 0.19344 of ‘1 , Bartlett’s test originally employed an
3 0.2 0.41156 approximation. Bartlett proved that
4 0.2 0.58400
5 0.2 0.54267 ln ‘1 =c e χ2 ðr 1Þ,
where
By substituting these values into the formula for h iP
r
1 1 1
‘1 , we obtain 1þ 3ðr1Þ ni 1
Nr
i¼1
c¼ :
0:46016 Nr
‘1 ¼ ¼ 0:90248:
0:509889 The approximation works poorly for small
sample sizes. This approximation is more accurate
Critical values, bα ðα, nÞ, of Bartlett’s test are as sample sizes increase, and it is recommended
tabled for cases in which the sample sizes are equal that minðni Þ ≥ 3 and that most ni > 5.
and α ¼ :05. These values are given in D. Dyer and
Jerome Keating for various values of r, the number
of groups, and n, the common sample size (see Assumptions
Table 2). Works by Glaser, M. T. Chao, and Glaser, Bartlett’s test statistic is quite sensitive to
and S. B. Nandi provide tables of exact critical nonnormality. In fact, W. J. Conover, M. E. John-
values of Bartlett’s test. The most extensive set son, and M. M. Johnson echo the results of
is contained in Dyer and Keating. Extensions (for G. E. P. Box that Bartlett’s test is very sensitive to
larger numbers of groups) to the table of critical samples that exhibit nonnormal kurtosis. They rec-
values can be found in Keating, Glaser, and ommend that Bartlett’s test be used only when the
N. S. Ketchum. data conform to normality. Prior to using Bartlett’s
test, it is recommended that one test for normality
using an appropriate test such as the Shapiro–Wilk
Approximation
W test. In the event that the normality assumption
In the event that the sample sizes are not equal, is violated, it is recommended that one test equal-
one can use the Dyer–Keating approximation to ity of variances using Howard Levene’s test.
the critical values:
Mark T. Leung and Jerome P. Keating
X
a
ni
bα ða; n1 ; . . . ; na Þ¼_ × bα ða,ni Þ: See also Critical Value; Likelihood Ratio Statistic;
i¼1
N Normality Assumption; Parametric Statistics; Variance
Bartlett’s Test 63
of correlation—is the ratio of category variance to category barycenters are computed from each of
the sum of category variance plus variance of the these sets. These barycenters are then projected
observations within each category. This coefficient onto the discriminant factor scores. The variability
is denoted R2 and is interpreted as the proportion of the barycenters can be represented graphically as
of variance of the observations explained by the a confidence ellipsoid that encompasses a given
categories or as the proportion of the variance proportion (say 95%) of the barycenters. When the
explained by the discriminant model. The perfor- confidence intervals of two categories do not over-
mance of the fixed-effect model can also be repre- lap, these two categories are significantly different.
sented graphically as a tolerance ellipsoid that In summary, BADIA is a GPCA performed on
encompasses a given proportion (say 95%) of the the category barycenters. GPCA encompasses vari-
observations. The overlap between the tolerance ous techniques, such as correspondence analysis,
ellipsoids of two categories is proportional to the biplot, Hellinger distance analysis, discriminant
number of misclassifications between these two analysis, and canonical variate analysis. For each
categories. specific type of GPCA, there is a corresponding
New observations can also be projected onto version of BADIA. For example, when the GPCA
the discriminant factor space, and they can be is correspondence analysis, this is best handled
assigned to the closest category. When the actual with the most well-known version of BADIA:
assignment of these observations is not known, the discriminant correspondence analysis. Because
model can be used to predict category member- BADIA is based on GPCA, it can also analyze data
ship. The model is then called a random model (as tables obtained by the concatenation of blocks
opposed to the fixed model). An obvious problem, (i.e., subtables). In this case, the importance (often
then, is to evaluate the quality of the prediction for called the contribution) of each block to the over-
new observations. Ideally, the performance of all discrimination can also be evaluated and repre-
the random-effect model is evaluated by counting sented as a graph.
the number of correct and incorrect classifications
for new observations and computing a confusion Hervé Abdi and Lynne J. Williams
matrix on these new observations. However, it is
See also Bootstrapping; Canonical Correlation Analysis;
not always practical or even feasible to obtain new
Correspondence Analysis; Discriminant Analysis;
observations, and therefore the random-effect per-
Jackknife; Matrix Algebra; Principal Components
formance is, in general, evaluated using computa-
Analysis
tional cross-validation techniques such as the
jackknife or the bootstrap. For example, a jack-
knife approach (also called leave one out) can Further Readings
be used by which each observation is taken out
Efron, B., & Tibshirani, R. J. (1993). An introduction to
of the set, in turn, and predicted from the model
the bootstrap. New York: Chapman & Hall.
built on the other observations. The predicted Greenacre, M. J. (1984). Theory and applications of
observations are then projected in the space of the correspondence analysis. London: Academic Press.
fixed-effect discriminant scores. This can also be Saporta, G., & Niang, N. (2006). Correspondence
represented graphically as a prediction ellipsoid. A analysis and classification. In M. Greenacre &
prediction ellipsoid encompasses a given propor- J. Blasius (Eds.), Multiple correspondence analysis and
tion (say 95%) of the new observations. The over- related methods (pp. 371–392). Boca Raton, FL:
lap between the prediction ellipsoids of two Chapman & Hall/CRC.
categories is proportional to the number of mis-
classifications of new observations between these
two categories.
The stability of the discriminant model can be BAYES’S THEOREM
assessed by a cross-validation model such as the
bootstrap. In this procedure, multiple sets of obser- Bayes’s theorem is a simple mathematical formula
vations are generated by sampling with replacement used for calculating conditional probabilities. It
from the original set of observations, and the figures prominently in subjectivist or Bayesian
66 Bayes’s Theorem
approaches to statistics, epistemology, and induc- Pierre-Simon Laplace published his paper ‘‘Mém-
tive logic. Subjectivists, who maintain that rational oire sur la Probabilité des Causes par les Évène-
belief is governed by the laws of probability, lean ments’’ in 1774 that Bayes’s ideas gained wider
heavily on conditional probabilities in their theo- attention. Laplace extended the use of inverse
ries of evidence and their models of empirical probability to a variety of distributions and intro-
learning. Bayes’s theorem is central to these para- duced the notion of ‘‘indifference’’ as a means of
digms because it simplifies the calculation of specifying prior distributions in the absence of
conditional probabilities and clarifies significant prior knowledge. Inverse probability became dur-
features of the subjectivist position. ing the 19th century the most commonly used
This entry begins with a brief history of method for making statistical inferences. Some of
Thomas Bayes and the publication of his theorem. the more famous examples of the use of inverse
Next, the entry focuses on probability and its role probability to draw inferences during this period
in Bayes’s theorem. Last, the entry explores mod- include estimation of the mass of Saturn, the prob-
ern applications of Bayes’s theorem. ability of the birth of a boy at different locations,
the utility of antiseptics, and the accuracy of judi-
cial decisions.
History
In the latter half of the 19th century, authori-
Thomas Bayes was born in 1702, probably in Lon- ties such as Siméon-Denis Poisson, Bernard Bol-
don, England. Others have suggested the place of zano, Robert Leslie Ellis, Jakob Friedrich Fries,
his birth to be Hertfordshire. He was the eldest of John Stuart Mill, and A. A. Cournot began to
six children of Joshua and Ann Carpenter Bayes. make distinctions between probabilities about
His father was a nonconformist minister, one of things and probabilities involving our beliefs
the first seven in England. Information on Bayes’s about things. Some of these authors attached the
childhood is scarce. Some sources state that he terms objective and subjective to the two types
was privately educated, and others state he of probability. Toward the end of the century,
received a liberal education to prepare for the min- Karl Pearson, in his Grammar of Science, argued
istry. After assisting his father for many years, he for using experience to determine prior distribu-
spent his adult life as a Presbyterian minister at the tions, an approach that eventually evolved into
chapel in Tunbridge Wells. In 1742, Bayes was what is now known as empirical Bayes. The
elected as a fellow by the Royal Society of Lon- Bayesian idea of inverse probability was also
don. He retired in 1752 and remained in Tun- being challenged toward the end of the 19th cen-
bridge Wells until his death in April of 1761. tury, with the criticism focusing on the use of
Throughout his life he wrote very little, and uniform or ‘‘indifference’’ prior distributions to
only two of his works are known to have been express a lack of prior knowledge.
published. These two essays are Divine Benevo- The criticism of Bayesian ideas spurred research
lence, published in 1731, and Introduction to the into statistical methods that did not rely on prior
Doctrine of Fluxions, published in 1736. He was knowledge and the choice of prior distributions.
known as a mathematician not for these essays but In 1922, Ronald Alymer Fisher’s paper ‘‘On the
for two other papers he had written but never pub- Mathematical Foundations of Theoretical Statis-
lished. His studies focused in the areas of probabil- tics,’’ which introduced the idea of likelihood
ity and statistics. His posthumously published and maximum likelihood estimates, revolutionized
article now known by the title ‘‘An Essay Towards modern statistical thinking. Jerzy Neyman and
Solving a Problem in the Doctrine of Chances’’ Egon Pearson extended Fisher’s work by adding
developed the idea of inverse probability, which the ideas of hypothesis testing and confidence
later became associated with his name as Bayes’s intervals. Eventually the collective work of Fisher,
theorem. Inverse probability was so called because Neyman, and Pearson became known as frequen-
it involves inferring backwards from the data to tist methods. From the 1920s to the 1950s, fre-
the parameter (i.e., from the effect to the cause). quentist methods displaced inverse probability as
Initially, Bayes’s ideas attracted little attention. It the primary methods used by researchers to make
was not until after the French mathematician statistical inferences.
Bayes’s Theorem 67
Interest in using Bayesian methods for statistical unknown event has happened and failed: Required
inference revived in the 1950s, inspired by Leonard the chance that the probability of its happening in
Jimmie Savage’s 1954 book The Foundations of a single trial lies somewhere between any two
Statistics. Savage’s work built on previous work of degrees of probability that can be named.’’ Bayes’s
several earlier authors exploring the idea of subjec- reasoning began with the idea of conditional
tive probability, in particular the work of Bruno de probability:
Finetti. It was during this time that the terms If PðBÞ > 0, the conditional probability of A
Bayesian and frequentist began to be used to refer given B, denoted by PðAjBÞ, is
to the two statistical inference camps. The number
of papers and authors using Bayesian statistics con- PðA ∩ BÞ PðABÞ
PðAjBÞ ¼ or :
tinued to grow in the 1960s. Examples of Bayesian PðBÞ PðBÞ
research from this period include an investigation
by Frederick Mosteller and David Wallace into the Bayes’s main focus then became defining
authorship of several of the Federalist papers and PðBjAÞ in terms of PðAjBÞ:
the use of Bayesian methods to estimate the para- A key component that Bayes needed was the
meters of time-series models. The introduction of law of total probability. Sometimes it is not possi-
Monte Carlo Markov chain (MCMC) methods to ble to calculate the probability of the occurrence
the Bayesian world in the late 1980s made compu- of an event A: However, it is possible to find
tations that were impractical or impossible earlier PðAjBÞ and PðAjBc Þ for some event Bwhere Bc is
realistic and relatively easy. The result has been the complement of B: The weighted average, PðAÞ,
a resurgence of interest in the use of Bayesian of the probability of A given that B has occurred
methods to draw statistical inferences. and the probability of A given that B has not
occurred can be defined as follows:
Let B be an event with PðBÞ > 0 and PðBc Þ > 0.
Publishing of Bayes’s Theorem Then for any event A,
Bayes never published his mathematical papers,
and therein lies a mystery. Some suggest his theo- PðAÞ ¼ PðAjBÞPðBÞ þ PðAjBc ÞPðBc Þ:
logical concerns with modesty might have played
If there are k events, B1; . . . ; Bk ; that form a par-
a role in his decision. However, after Bayes’s death,
tition of the sample space, and A is another event
his family asked Richard Price to examine Bayes’s
in the sample space, then the events
work. Price was responsible for the communica-
B1 A, B2 A, . . . , Bk A will form a partition of A:
tion of Bayes’s essay on probability and chance to
Thus, the law of total probability can be extended
the Royal Society. Although Price was making
as follows:
Bayes’s work known, he was occasionally mis-
Let Bj be an event with PðBjÞ > 0 for
taken for the author of the essays and for a time
j ¼ 1, . . . , k: Then for any event A,
received credit for them. In fact, Price only added
introductions and appendixes to works he had X
k
published for Bayes, although he would eventually PðAÞ ¼ PðBj ÞPðAjBj Þ:
write a follow-up paper to Bayes’s work. j¼1
The present form of Bayes’s theorem was actu-
ally derived not by Bayes but by Laplace. Laplace These basic rules of probability served as the
used the information provided by Bayes to con- inspiration for Bayes’s theorem.
struct the theorem in 1774. Only in later papers
did Laplace acknowledge Bayes’s work.
Bayes’s Theorem
Bayes’s theorem allows for a reduction in uncer-
Inspiration of Bayes’s Theorem
tainty by considering events that have occurred.
In ‘‘An Essay Towards Solving a Problem in the The theorem is applicable as long as the probabil-
Doctrine of Chances,’’ Bayes posed a problem to ity of the more recent event (given an earlier event)
be solved: ‘‘Given the number of times in which an is known. With this theorem, one can find the
68 Bayes’s Theorem
estimate the posterior distribution of the para- that move with a probability that is dependent
meters. This posterior distribution is used to on the current and proposed state.
infer the values of the parameters, along with In order to perform valid inference, the Mar-
the associated uncertainty. Multiple tests and kov chain must have approximately converged
predictions can be performed simultaneously to the posterior distribution before the samples
and flexibly. Quantities of interest that are func- are stored and used for inference. In addition,
tions of the parameters are straightforward to enough samples must be stored after conver-
estimate, again including the uncertainty. Poste- gence to have a large effective sample size; if the
rior inferences can be updated as more data are autocorrelation of the chain is high, then the
obtained, so study design is more flexible than number of samples needs to be large. Lack of
for frequentist methods. convergence or high autocorrelation of the chain
Bayesian inference is possible in a number of is detected via convergence diagnostics, which
contexts in which frequentist methods are defi- include autocorrelation and trace plots, as well
cient. For instance, Bayesian inference can be per- as Geweke, Gelman–Rubin, and Heidelberger–
formed with small data sets. More broadly, Welch diagnostics. Software for MCMC can also
Bayesian statistics is useful when the data set may be validated by a distinct set of techniques.
be large but when few data points are associated These techniques compare the posterior samples
with a particular treatment. In such situations drawn by the software with samples from the
standard frequentist estimators can be inappropri- prior and the data model, thereby validating the
ate because the likelihood may not be well approx- joint distribution of the data and parameters as
imated by a normal distribution. The use of estimated by the software.
Bayesian statistics also allows for the incorpora-
tion of prior information and for simultaneous Brandon K. Vaughn and Daniel L. Murphy
inference using data from multiple studies. Infer-
See also Estimation; Hypothesis; Inference: Deductive and
ence is also possible for complex hierarchical
Inductive; Parametric Statistics; Probability, Laws of
models.
Lately, computation for Bayesian models is
most often done via MCMC techniques, which
Further Readings
obtain dependent samples from the posterior dis-
tribution of the parameters. In MCMC, a set of Bayes, T. (1763). An essay towards solving a problem in
initial parameter values is chosen. These parame- the doctrine of chances. Philosophical Transactions of
ter values are then iteratively updated via a spe- the Royal Society of London, 53, 370–418.
cially constructed Markovian transition. In the Box, G. E., & Jenkins, G. M. (1970). Time series
limit of the number of iterations, the parameter analysis: Forecasting and control. San Francisco:
Holden-Day.
values are distributed according to the posterior
Dale, A. I. (1991). A history of inverse probability from
distribution. In practice, after approximate con- Thomas Bayes to Karl Pearson. New York: Springer-
vergence of the Markov chain, the time series of Verlag.
sets of parameter values can be stored and then Daston, L. (1994). How probabilities came to be
used for inference via empirical averaging (i.e., objective and subjective. Historia Mathematica, 21,
Monte Carlo). The accuracy of this empirical 330–344.
averaging depends on the effective sample size of Fienberg, S. E. (1992). A brief history of statistics in three
the stored parameter values, that is, the number and one-half chapters: A review essay. Statistical
of iterations of the chain after convergence, Science, 7, 208–225.
adjusted for the autocorrelation of the chain. Fienberg, S. E. (2006). When did Bayesian analysis
become ‘‘Bayesian’’? Bayesian Analysis, 1(1), 1–40.
One method of specifying the Markovian transi-
Fisher, R. A. (1922). On the mathematical foundations
tion is via Metropolis–Hastings, which proposes of theoretical statistics. Philosophical Transactions
a change in the parameters, often according to of the Royal Society of London, Series A, 222,
a random walk (the assumption that many 309–368.
unpredictable small fluctuations will occur in Laplace, P. S. (1774). Mémoire sur la probabilité des
a chain of events), and then accepts or rejects causes par les évènements [Memoir on the probability
Behavior Analysis Design 71
unique designs, which have been outlined by James response is typically more broadly defined and
Johnston and Hank Pennypacker. Consequently, may be highly individualized. For example, self-
this method takes an approach to the collection, injurious behavior in a child with autism may
validity, analysis, and generality of data that is dif- include many forms that meet a common defini-
ferent from approaches that primarily use group tion of minimum force that leaves a mark. Just as
designs and inferential statistics to study behavior. in basic research, a variety of behavioral measure-
ments can be used as dependent variables. The
response class must be sensitive to the influence
Measurement Considerations of the independent variable (IV) without being
affected by extraneous variables so that effects can
Defining Response Classes
be detected. The response class must be defined in
Measurement in single-subject design is objec- such a way that researchers can clearly observe
tive and restricted to observable phenomena. and record behavior.
Measurement considerations can contribute to
behavioral variability that can obscure experimen- Observation and Recording
tal effects, so care must be taken to avoid potential Once researchers define a response class, the
confounding variables. Measurement focuses on methods of observation and recording are impor-
targeting a response class, which is any set of tant in order to obtain a complete and accurate
responses that result in the same environmental record of the subject’s behavior. Measurement is
change. Response classes are typically defined by direct when the focus of the experiment is the
function rather than topography. This means that same as the phenomenon being measured. Indirect
the form of the responses may vary considerably measurement is typically avoided in behavioral
but produce the same result. For example, a button research because it undermines experimental con-
can be pressed several ways, with one finger, with trol. Mechanical, electrical, or electronic devices
the palm, with the toe, or with several fingers. The can be used to record responses, or human obser-
exact method of action is unimportant, but any vers can be selected and trained for data collection.
behavior resulting in button depression is part of Machine and human observations may be used
a response class. Topographical definitions are together throughout an experiment. Behavior is
likely to result in classes that include some or all continuous, so observational procedures must be
of several functional response classes, which can designed to detect and record each response within
produce unwanted variability. Researchers try to the targeted response class.
arrange the environment to minimize variability
within a clearly defined response class.
There are many ways to quantify the occurrence Experimental Design and
of a response class member. The characteristics of Demonstration of Experimental Effects
the behavior captured in its definition must suit
Experimental Arrangements
the needs of the experiment, be able to address the
experimental question, and meet practical limits The most basic single-subject experimental
for observation. In animal studies, a response is design is the baseline–treatment sequence, the AB
typically defined as the closing of a circuit in an design. This procedure cannot account for certain
experimental chamber by depressing a lever or confounds, such as maturation, environmental his-
pushing a key or button. With this type of tory, or unknown extraneous variables. Replicat-
response, the frequency and duration that a circuit ing components of the AB design provide
is closed can be recorded. Conditions can be additional evidence that the IV is the source of any
arranged to measure the force used to push the change in the dependent measure. Replication
button or lever, the amount of time that occurs designs consist of a baseline or control condition
between responses, and the latency and accuracy (A), followed by one or more experimental or
of responding in relation to some experimentally treatment conditions (B), with additional condi-
arranged stimulus. These measurements serve tions indicated by successive letters. Subjects expe-
as dependent variables. In human studies, the rience both the control and the experimental
Behavior Analysis Design 73
conditions, often in sequence and perhaps more experiment. The causes of variability can often be
than once. An ABA design replicates the original identified and systematically evaluated. Behavior
baseline, while an ABAB design replicates the analysts have demonstrated that frequently chang-
baseline and the experimental conditions, allowing ing the environment results in greater degrees of
researchers to infer causal relationships between variability. Inversely, holding the environment
variables. These designs can be compared with constant for a time allows behavior to stabilize
a light switch. The first time one moves the switch and minimizes variability. Murray Sidman has
from the on position to the off position, one can- offered several suggestions for decreasing variabil-
not be completely certain that one’s behavior was ity, including strengthening the variables that
responsible for the change in lighting conditions. directly maintain the behavior of interest, such as
One cannot be sure the light bulb did not burn out increasing deprivation, increasing the intensity of
at that exact moment or the electricity did not shut the consequences, making stimuli more detectable,
off coincidentally. Confidence is bolstered when or providing feedback to the subject. If these
one pushes the switch back to the on position and changes do not immediately affect variability, it
the lights turn back on. With a replication of mov- could be that behavior requires exposure to the
ing the switch to off again, one has total confi- condition for a longer duration. Employing these
dence that the switch is controlling the light. strategies to control variability increases the likeli-
Single-subject research determines the effective- hood that results can be interpreted and replicated.
ness of the IV by eliminating or holding constant
any potential confounding sources of variability.
One or more behavioral measures are used as Reduction of Confounding Variables
dependent variables so that data comparisons are
Extraneous, or confounding, variables affect the
made from one condition to another. Any change
detection of behavioral change due to the IV. Only
in behavior between the control and the experi-
by eliminating or minimizing external sources of
mental conditions is attributed to the effects of the
variability can data be judged as accurately reflect-
IV. The outcome provides a detailed interpretation
ing performance. Subjects should be selected that
of the effects of an IV on the behavior of the
are similar along extra-experimental dimensions in
subject.
order to reduce extraneous sources of variability.
Replication designs work only in cases in which
For example, it is common practice to use animals
effects are reversible. Sequence effects can occur
from the same litter or to select human partici-
when experience in one experimental condition
pants on the basis of age, level of education, or
affects a subject’s behavior in subsequent condi-
socioeconomic status. Environmental history of
tions. The researcher must be careful to ensure
an organism can also influence the target behav
consistent experimental conditions over replica-
ior; therefore, subject selection methods should
tions. Multiple-baseline designs with multiple indi-
attempt to minimize differences between subjects.
viduals, multiple behaviors, or multiple settings
Some types of confounding variables cannot be
can be used in circumstances in which sequence
removed, and the researcher must design an exper-
effects occur, or as a variation on the AB design.
iment to minimize their effects.
Results are compared across control and experi-
mental conditions, and factors such as irreversibil-
ity of effects, maturation of the subject, and
Steady State Behavior
sequence effect can be examined.
Single-subject designs rely on the collection of
steady state baseline data prior to the administra-
Behavioral Variability
tion of the IV. Steady states are obtained by expos-
Variability in single-subject design refers both to ing the subject to only one condition consistently
variations in features of responding within a single until behavior stabilizes over time. Stabilization
response class and to variations in summary mea- is determined by graphically examining the vari-
sures of that class, which researchers may be ability in behavior. Stability can be defined as a pat-
examining across sessions or entire phases of the tern of responding that exhibits relatively little
74 Behavior Analysis Design
variation in its measured dimensional quantities individual response variability but can highlight the
over time. effects of the experimental conditions on respond-
Stability criteria specify the standards for evalu- ing, thus promoting steady states. Responding sum-
ating steady states. Dimensions of behavior such marized across individual sessions represents some
as duration, latency, rate, and intensity can be combination of individual responses across a group
judged as stable or variable during the course of of sessions, such as mean response rate during
experimental study, with rate most commonly used baseline conditions. This method should not be the
to determine behavioral stability. Stability criteria only means of analysis but is useful when one is
must set limits on two types of variability over looking for differences among sets of sessions shar-
time. The first is systematic increases and decreases ing common characteristics.
of behavior, or trend, and the second is unsystem- Single-subject design uses ongoing behavioral
atic changes in behavior, or bounce. Only when data to establish steady states and make decisions
behavior is stable, without trend or bounce, should about the experimental conditions. Graphical anal-
the next condition be introduced. Specific stability ysis is completed throughout the experiment, so
criteria include time, visual inspection of graphical any problems with the design or measurement can
data, and simple statistics. Time criteria can desig- be uncovered immediately and corrected. How-
nate the number of experimental sessions or dis- ever, graphical analysis is not without criticism.
crete period in which behavior stabilizes. The time Some have found that visual inspection can be
criterion chosen must encompass even the slowest insensitive to small but potentially important dif-
subject. A time criterion allowing for longer expo- ferences of graphic data. When evaluating the sig-
sure to the condition may needlessly lengthen the nificance of data from this perspective, one must
experiment if stability occurs rapidly; on the other take into account the magnitude of the effect, vari-
hand, behavior might still be unstable, necessitat- ability in data, adequacy of experimental design,
ing experience and good judgment when a time value of misses and false alarms, social signifi-
criterion is used. A comparison of steady state cance, durability of behavior change, and number
behavior under baseline and different experimental and kinds of subjects. The best approach to analy-
conditions allows researchers to examine the sis of behavioral data probably uses some combi-
effects of the IV. nation of both graphical and statistical methods
because each approach has relative advantages
and disadvantages.
Scientific Discovery Through Data Analysis
Single-subject designs use visual comparison of
Judging Significance
steady state responding between conditions as the
primary method of data analysis. Visual analysis Changes in level, trend, variability, and serial
usually involves the assessment of several variables dependency must be detected in order for one to
evident in graphed data. These variables include evaluate behavioral data. Level refers to the gen-
upward or downward trend, the amount of vari- eral magnitude of behavior for some specific
ability within and across conditions, and differ- dimension. For example, 40 responses per minute
ences in means and stability both within and across is a lower level than 100 responses per minute.
conditions. Continuous data are displayed against Trend refers to the increasing or decreasing nature
the smallest unit of time that is likely to show sys- of behavior change. Variability refers to changes in
tematic variability. Cumulative graphs provide the behavior from measurement to measurement.
greatest level of detail by showing the distribution Serial dependency occurs when a measurement
of individual responses over time and across vari- obtained during one time period is related to
ous stimulus conditions. Data can be summarized a value obtained earlier.
with less precision by the use of descriptive statis- Several features of graphs are important, such
tics such as measures of central tendency (mean as trend lines, axis units, number of data points,
and median), variation (interquartile range and and condition demarcation. Trend lines are lines
standard deviation), and association (correlation that fit the data best within a condition. These
and linear regression). These methods obscure lines allow for discrimination of level and may
Behrens–Fisher t0 Statistic 75
assist in discrimination of behavioral trends. The Johnston, J. M., & Pennypacker, H. S. (1993). Strategies
axis serves as an anchor for data, and data points and tactics of behavioral research (2nd ed.). Hillsdale,
near the bottom of a graph are easier to interpret NJ: Lawrence Erlbaum.
than data in the middle of a graph. The number of Mazur, J. E. (2007). Learning and behavior (6th ed.).
Upper Saddle River, NJ: Pearson Prentice Hall.
data points also seems to affect decisions, with
Poling, A., & Grossett, D. (1986). Basic research designs
fewer points per phase improving accuracy. in applied behavior analysis. In A. Poling & R. W.
Fuqua (Eds.), Research methods in applied behavior
analysis: Issues and advances (pp. 7–27). New York:
Generality Plenum Press.
Generality, or how the results of an individual Sidman, M. (1960). Tactics of scientific research:
Evaluating experimental data in psychology. Boston:
experiment apply in a broader context outside
Authors Cooperative.
the laboratory, is essential to advancing science.
Skinner, B. F. (1938). The behavior of organisms: An
The dimensions of generality include subjects, experimental analysis. New York: D. Appleton-
response classes, settings, species, variables, Century.
methods, and processes. Single-subject designs Skinner, B. F. (1965). Science and human behavior. New
typically involve a small number of subjects that York: Free Press.
are evaluated numerous times, permitting in- Watson, J. B. (1913). Psychology as a behaviorist views
depth analysis of these individuals and the phe- it. Psychological Review, 20, 158–177.
nomenon in question, while providing systematic
replication. Systematic replication enhances gen-
erality of findings to other populations or condi-
tions and increases internal validity. The internal
BEHRENS–FISHER t0 STATISTIC
validity of an experiment is demonstrated when
additional subjects demonstrate similar behavior The Behrens–Fisher t0 statistic can be employed
under similar conditions; although the absolute when one seeks to make inferences about the
level of behavior may vary among subjects, the means of two normal populations without assum-
relationship between the IV and the relative ing the variances are equal. The statistic was
effect on behavior has been reliably demon- offered first by W. U. Behrens in 1929 and refor-
strated, illustrating generalization. mulated by Ronald A. Fisher in 1939:
and compared with the percentage points of the program, Prðt0 > 2:143Þ ¼ :049; indicating the
Behrens–Fisher distribution. Tables for the Beh- null hypothesis cannot be rejected at α ¼ :05 when
rens–Fisher distribution are available, and the the alternative hypothesis is nondirectional,
table entries are prepared on the basis of the four Ha : μ1 6¼ μ2 , because p ¼ :098. The correspond-
numbers ν1 ¼ n1 1, ν2 ¼ n2 1, θ, and the Type ing 95% interval for the population mean differ-
I error rate α. For example, Ronald A. Fisher and ence is ½0:421, 3:308.
Frank Yates in 1957 presented significance points
of the Behrens–Fisher distribution in two tables,
Related Methods
one for ν1 and ν2 ¼ 6, 8, 12, 24, ∞; θ ¼ 0 ;
15 , 30 , 45 , 60 , 75 , 90 ; and α ¼ :05, :01, and The Student’s t test for independent means can be
the other for ν1 that is greater than used when the two population variances are
ν2 ¼ 1, 2, 3,4, 5, 6, 7; θ ¼ 0 , 15 , 30 , 45 , 60 , 75 , assumed to be equal and σ 21 ¼ σ 22 ¼ σ 2 :
90 and α ¼ :10; :05, :02, :01: Seock-Ho Kim and
Allan S. Cohen in 1998 presented significance ðx1 x2 Þ ðμ1 μ2 Þ
t¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
. . ffi ,
points of the Behrens–Fisher distribution for ν1 2 2
that is greater than ν2 ¼ 2, 4, 6, 8, 10, 12; sp n1 þ sp n2
θ ¼ 0 , 15 , 30 , 45 , 60 , 75 , 90 ; and α ¼ :10,
:05, :02, :01, and also offered computer programs where the pooled variance that provides the estimate
for obtaining tail areas and percentage values of of the common population variance 2
σ is defined
the Behrens–Fisher distribution. 2 2 2
as sp ¼ ðn1 1Þs1 þ ðn2 1Þs2 ðn1 þ n2 2Þ. It
Using the Behrens–Fisher distribution, one can has a t distribution with ν ¼ n1 þ n2 2 degrees of
construct the 100ð1 αÞ% interval that contains freedom. The example data yield the Student’s
μ1 μ2 with t ¼ 3:220, ν ¼ 14; the two-tailed p ¼ :006, and the
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 95% confidence interval of [0.482, 2.405]. The null
x1 x2 ± tα=2
0
ðν1 ; ν2 ; θÞ s21 =n1 þ s22 =n2 ; hypothesis of equal population means is rejected at
the nominal α ¼ :05, and the confidence interval
does not contain 0.
where the probability
0 that t0 > tα=2
0
ðν1, ν2 , θÞ is α=2 When the two variances cannot be assumed to
0
or, equivalently, Pr t > t α=2 ðν1 ; ν2 ; θÞ ¼ α=2: be the same, one of the solutions is to use the
This entry first illustrates the statistic with an Behrens–Fisher t0 statistic. There are several
example. Then related methods are presented, and alternative solutions. One simple way to solve
the methods are compared. the two means problem, called the smaller
degrees of freedom t test, is to use the same t0
Example statistic that has a t distribution with different
degrees of freedom:
Driving times from a person’s house to work were
measured for two different routes with n1 ¼ 5 and t0 e t½minðv1 ; ν2 Þ;
n2 ¼ 11: The ordered data from the first route are
6.5, 6.8, 7.1, 7.3, 10.2, yielding x1 ¼ 7:580 and where the degrees of freedom is the smaller value
s21 ¼ 2:237, and the data from the second route of ν1 or ν2. Note that this method should be used
are 5.8, 5.8, 5.9, 6.0, 6.0, 6.0, 6.3, 6.3, 6.4, 6.5, only if no statistical software is available because
6.5, yielding x2 ¼ 6:136 and s22 ¼ 0:073. It is it yields a conservative test result and a wider con-
assumed that the two independent samples were fidence interval. The example data yield
drawn from two normal distributions having t0 ¼ 2:143, ν ¼ 4, the two-tailed p ¼ :099, and the
means μ1 and μ2 and variances σ 21 and σ 22 ; respec- 95% confidence interval of ½0:427; 3:314: The
tively. A researcher wants to know whether the null hypothesis of equal population means is not
average driving times differed for the two routes. rejected at α ¼ :05, and the confidence interval
The test statistic under the null hypothesis of contains 0.
equal population means is t0 ¼ 2:143 with ν1 ¼ 4, B. L. Welch in 1938 presented an approxi-
ν2 ¼ 10; and θ ¼ 83:078: From the computer mate t test. It uses the same t0 statistic that has
Behrens–Fisher t0 Statistic 77
a t distribution with the approximate degrees of Jerzy Neyman and Egon S. Pearson’s sampling the-
freedom ν0 : ory. Among the methods, Welch’s approximate
t test and the Welch–Aspin t test are the most
t0 e tðν0 Þ; important ones from the frequentist perspective.
The critical values and the confidence intervals
.h . i
from various methods under the frequentist
where ν0 ¼ 1 c2 ν1 þ ð1 cÞ2 ν2 with
2 2 2 approach are in general different from those of
c ¼ s1 n1 s1 n1 þ s2 n2 . The approxima- either the fiducial or the Bayesian approach. For
tion is accurate when both sample sizes are 5 or the one-sided alternative hypothesis, however, it is
larger. Although there are other solutions, Welch’s interesting to note that the generalized extreme
approximate t test might be the best practical solu- region to obtain the generalized p developed by
tion to the Behrens–Fisher problem because of its Kam-Wah Tsui and Samaradasa Weerahandi in
availability from the popular statistical software, 1989 is identical to the extreme area from the
including SPSS (an IBM company, formerly called Behrens–Fisher t0 statistic.
PASWâ Statistics) and SAS. The example data The critical values for the two-sided alternative
yield t0 ¼ 2:143, ν0 ¼ 4:118, the two-tailed hypothesis at α ¼ :05 for the example data are
p ¼ :097, and the 95% confidence interval of 2.776 for the smaller degrees of freedom t test,
½0:406; 3:293 The null hypothesis of equal 2.767 for the Behrens–Fisher t0 test, 2.745 for
population means is not rejected at α ¼ :05, and Welch’s approximate t test, 2.715 for the Welch–
the confidence interval contains 0. Aspin t test, and 2.145 for the Student’s t test. The
In addition to the previous method, the Welch– respective 95% fiducial and confidence intervals
Aspin t test employs an approximation of the distri- are ½0:427, 3:314 for the smaller degrees of free-
bution of t0 by the method of moments. The exam- dom test, ½0:421, 3:308 for the Behrens–Fisher
ple data yield t0 ¼ 2:143, and the critical value t0 test, ½0:406, 3:293 for Welch’s approximate
under the Welch–Aspin t test for the two-tailed test t test, ½0:386, 3:273 for the Welch–Aspin t test,
is 2.715 at α ¼ :05. The corresponding 95% confi- and [0.482, 2.405] for the Student’s t test. The
dence interval is ½0:386, 3:273: Again, the null smaller degrees of freedom t test yielded the most
hypothesis of equal population means is not rejected conservative result with the largest critical value
at α ¼ :05, and the confidence interval contains 0. and the widest confidence interval. The Student’s t
test yielded the smallest critical value and the
shortest confidence interval. All other intervals lie
Comparison of Methods
between these two intervals. The differences
The Behrens–Fisher t0 statistic and the Behrens– between many solutions to the Behrens–Fisher
Fisher distribution are based on Fisher’s fiducial problem might be less than their differences from
approach. The approach is to find a fiducial proba- the Student’s t test when sample sizes are greater
bility distribution that is a probability distribution than 10.
of a parameter from observed data. Consequently, The popular statistical software programs SPSS
0 and SAS produce results from Welch’s approxi-
the interval that involves tα=2 ðν1 , ν2 , θÞ is referred
to as the 100ð1 αÞ% fiducial interval. mate t test and the Student’s t test, as well as the
The Bayesian solution to the Behrens–Fisher respective confidence intervals. It is essential to
problem was offered by Harold Jeffreys in 1940. have a table that contains the percentage points of
When uninformative uniform priors are used for the Behrens–Fisher distribution or computer pro-
the population parameters, the Bayesian solution grams that can calculate the tail areas and percent-
to the Behrens–Fisher problem is identical to that age values in order to use the Behrens–Fisher t0 test
of Fisher’s in 1939. The Bayesian highest posterior or to obtain the fiducial interval. Note that Welch’s
density interval that contains the population mean approximate t test may not be as effective as the
difference with the probability of 1 α is identical Welch–Aspin t test. Note also that the sequential
to the 100ð1 αÞ% fiducial interval. testing of the population means on the basis of the
There are many solutions to the Behrens–Fisher result from either Levene’s test of the equal popu-
problem based on the frequentist approach of lation variances from SPSS or the folded F test
78 Bernoulli Distribution
from SAS is not recommended in general because it is the simplest probability distribution, it pro-
of the complicated nature of control of the Type I vides a basis for other important probability distri-
error (rejecting a true null hypothesis) in the butions, such as the binomial distribution and the
sequential testing. negative binomial distribution.
Seock-Ho Kim
Definition and Properties
See also Mean Comparisons; Student’s t Test; t Test,
An experiment of chance whose result has only
Independent Samples
two possibilities is called a Bernoulli trial (or Ber-
noulli experiment). Let p denote the probability of
Further Readings success in a Bernoulli trial ð0 < p < 1Þ. Then,
a random variable X that assigns value 1 for a suc-
Behrens, W. U. (1929). Ein Beitrag zur Fehlerberechnung
bei wenigen Beobachtungen [A contribution to error
cess with probability p and value 0 for a failure
estimation with few observations]. with probability 1 p is called a Bernoulli ran-
Landwirtschaftliche Jahrbücher, 68, 807–837. dom variable, and it follows the Bernoulli distribu-
Fisher, R. A. (1939). The comparison of samples with tion with probability p, which is denoted by
possibly unequal variances. Annals of Eugenics, 9, X e BerðpÞ. The probability mass function of
174–180. Ber(p) is given by
Fisher, R. A., & Yates, F. (1957). Statistical tables for
biological, agricultural and medical research (4th ed.).
Edinburgh, UK: Oliver and Boyd.
PðX ¼ xÞ ¼ px ð1 pÞ1x ; x ¼ 0; 1:
Jeffreys, H. (1940). Note on the Behrens-Fisher formula.
Annals of Eugenics, 10, 48–51. The mean of X is p, and the variance is
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). pð1 pÞ. Figure 1 shows the probability mass
Continuous univariate distributions (Vol. 2, 2nd ed.). function of Ber(.7). The horizontal axis represents
New York: Wiley. values of X, and the vertical axis represents the
Kendall, M., & Stuart, A. (1979). The advanced theory corresponding probabilities. Thus, the height is .7
of statistics (Vol. 2, 4th ed.). New York: Oxford
at X ¼ 1, and .3 for X ¼ 0. The mean of Ber(0.7)
University Press.
Kim, S.-H., & Cohen, A. S. (1998). On the Behrens-
is 0.7, and the variance is .21.
Fisher problem: A review. Journal of Educational and Suppose that a Bernoulli trial with probability p
Behavioral Statistics, 23, 356–377. is independently repeated for n times, and we
Tsui, K.-H., & Weerahandi, S. (1989). Generalized obtain a random sample X1 , X2 ; . . . ; Xn : Then, the
p-values in significance testing of hypotheses in the number of successes Y ¼ X1 þ X2 þ þ Xn fol-
presence of nuisance parameters. Journal of the lows the binomial distribution with probability
American Statistical Association, 84, 602–607;
Correction, 86, 256.
Welch, B. L. (1938). The significance of the difference 1.0
between two means when the population variances are
unequal. Biometrika, 29, 350–362. .8
Probability
.6
BERNOULLI DISTRIBUTION .4
.2
The Bernoulli distribution is a discrete probability
distribution for a random variable that takes only .0
two possible values, 0 and 1. Examples of events 0 1
X
that lead to such a random variable include coin
tossing (head or tail), answers to a test item (cor-
rect or incorrect), outcomes of a medical treatment Figure 1 Probability Mass Function of the Bernoulli
(recovered or not recovered), and so on. Although Distribution With p ¼ :7
Bernoulli Distribution 79
p and the number of trials n, which is denoted by case of the negative binomial distribution in which
Y e Binðn, pÞ. Stated in the opposite way, the the number of failures is counted before observing
Bernoulli distribution is a special case of the bino- the first success (i.e., t ¼ 1).
mial distribution in which the number of trials n is Assume a finite Bernoulli population in which
1. The probability mass function of Bin(n,p) is individual members are denoted by either 0 or 1.
given by If sampling is done by randomly selecting one
member at each time with replacement (i.e., each
n! selected member is returned to the population
PðY ¼ yÞ ¼ py ð1 pÞny ,
y!ðn yÞ! before the next selection is made), then the result-
y ¼ 0; 1; . . . ; n, ing sequence constitutes independent Bernoulli
trials, and the number of successes follows the
where n! is the factorial of n; which equals the binomial distribution. If sampling is done at ran-
product nðn 1Þ 2 · 1. The mean of Y is np, dom but without replacement, then each of the
and the variance is npð1 pÞ: Figure 2 shows the individual selections is still a Bernoulli trial, but
probability mass function of Bin(10,.7), which is they are no longer independent of each other. In
obtained as the distribution of the sum of 10 inde- this case, the number of successes follows the
pendent random variables, each of which follows hypergeometric distribution, which is specified by
Ber(.7). The height of each bar represents the the population probability p, the number of trials
probability that Y takes the corresponding value; n, and the population size m.
for example, the probability of Y ¼ 7 is about .27. Various approximations are available for the
The mean is 7 and the variance is 2.1. In general, binomial distribution. These approximations
the distribution is skewed to the right when are extremely useful when n is large because
p < :5, skewed to the left when p > :5; and sym- in that case the factorials in the binomial proba-
metric when p ¼ :5: bility mass function become prohibitively large
and make probability calculations tedious.
For example,pby the central
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi limit theorem,
Relationship to Other Probability Distributions
Z ¼ ðY npÞ npð1 pÞ approximately fol-
The Bernoulli distribution is a basis for many lows the standard normal distribution Nð0, 1Þ
probability distributions, as well as for the bino- when Y e Binðn, pÞ. The constant 0.5 is often
mial distribution. The number of failures before added to the denominator to improve the
observing a success t times in independent Ber- approximation (called continuity correction). As
noulli trials follows the negative binomial distribu- a rule of thumb, the normal approximation
tion with probability p and the number of works well when either (a) npð1 pÞ > 9 or
successes t. The geometric distribution is a special (b) np > 9 for 0 < p ≤ :5. The Poisson distribu-
tion with parameter np also well approximates
.30 Bin(n,p) when n is large and p is small. The
Poisson approximation works well if
0:31
n p > :47; for example, p > :19, :14, and :11
when n ¼ 20, 50, and 100, respectively. If
Probability
.20
n0:31 p ≥ :47, then the normal distribution gives
better approximations.
.10
.00 Estimation
0 1 2 3 4 5 6 7 8 9 10
Inferences regarding the population proport-
Y
ion p can be made from a random sample
X1 , X2 , . . . , Xn from Ber(p), whose sum follows
Figure 2 Probability Mass Function of the Binomial Bin(n, p). The population proportion p can be
Distribution With p ¼ 7 and n ¼ 10 estimated by the sample mean (or the sample
80 Bernoulli Distribution
P
proportion) p^ ¼ X ¼ ni¼1 Xi =n; which is an unbi- variable (i.e., Y ¼ 0; 1), the logistic regression
ased estimator of p. model is expressed by the equation
Interval estimation is usually made by the nor-
mal approximation. If n is large enough (e.g., pðxÞ
ln ¼ b0 þ b1 x1 þ þ bK xK ,
n > 100), a 100ð1 αÞ% confidence interval is 1 pðxÞ
given by
where ln is the natural logarithm, pðxÞ is the prob-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^ð1 p
p ^Þ ability of Y ¼ 1 (or the expected value of Y) given
^ ± zα=2
p ; x1 ; x2 ; . . . ; xK ; and b0 ; b1 ; . . . ; bK are the regres-
n
sion coefficients. The left-hand side of the above
where p^ is the sample proportion and zα=2 is the equation is called the logit, or the log-odds ratio,
value of the standard normal variable that gives of proportion p. The logit is symmetric about zero;
the probability α=2 in the right tail. For smaller ns, it is positive (negative) if p > :5 (p < :5), and zero
the quadratic approximation gives better results: if p ¼ :5. It approaches positive (negative) infinity
0 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 as p approaches 1 (0). Another representation
1 z 2
^ð1 p
p ^Þ z2α=2 equivalent to the above is
@p ^ þ α=2 ± zα=2 þ 2 A:
1 þ zα=2 n 2n n 4n expðb0 þ b1 x1 þ þ bK xK Þ
pðxÞ ¼ :
1 þ expðb0 þ b1 x1 þ þ bK xK Þ
The quadratic approximation works well if
:1 < p < :9 and n is as large as 25. The right-hand side is called the logistic
There are often cases in which one is interested regression function. In either case, the model
in comparing two population proportions. Sup- states that the distribution of Y given predictors
pose that we obtained sample proportions p ^1 and x1 , x2 , . . . , xK is Ber[p(x)], where the logit of p(x)
^2 with sample sizes n1 and n2 , respectively. Then,
p is determined by a linear combination of predic-
the difference between the population proportions tors x1 , x2 , . . . , xK . The regression coefficients
is estimated by the difference between the sample are estimated from N sets of observed data
proportions p^1 p^2 . Its standard error is given by (Yi ; x i1 ; xi2 ; . . . ; xiK Þ; i ¼ 1; 2; . . . ; N:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^1 ð1 p
p ^1 Þ p ^ ð1 p ^2 Þ The Binomial Error Model
SEðp^1 p
^2 Þ ¼ þ 2 ,
n1 n2
The binomial error model is one of the mea-
from which one can construct a 100ð1 αÞ confi- surement models in the classical test theory. Sup-
dence interval as pose that there are n test items, each of which is
scored either 1 (correct) or 0 (incorrect). The bino-
ðp ^2 Þ ± zα=2 SEðp
^1 p ^1 p
^2 Þ: mial error model assumes that the distribution
of person i’s total score Xi given his or her
‘‘proportion-corrected’’ true score ζi ð0 < ζi < 1Þ is
Applications Binðn; ζi Þ:
Logistic Regression n! nx
PðXi ¼ xjζi Þ ¼ ζx ð1 ζi Þ ,
Logistic regression is a regression model about x!ðn xÞ! i
the Bernoulli probability and used when the x ¼ 0, 1, . . . , n:
dependent variable takes only two possible values.
Logistic regression models are formulated as gener- This model builds on a simple assumption that
alized linear models in which the canonical link for all items, the probability of a correct response
function is the logit link and the Bernoulli distribu- for a person with true score ζi is equal to ζi , but
tion is assumed for the dependent variable. the error variance, nζi ð1 ζi Þ, varies as a function
In the standard case in which there are K linear of ζi unlike the standard classical test model.
predictors x1 , x2 , . . . , xK and the dependent vari- The observed total score Xi ¼ xi serves as an
able Y, which represents a Bernoulli random estimate of nζi , and the associated error variance
Beta 81
can also be estimated as σ^i2 ¼ xi ðn xi Þ=ðn 1Þ. power of a test, equal to 1 β rather than β itself,
Averaging this error variance over is referred to as a measure of quality for a hypothe-
N persons gives the overall error variance sis test. This entry discusses the role of β in
σ^2 ¼ ½xðn xÞ s2 ðn 1Þ, where x is the sam- hypothesis testing and its relationship with
ple mean of observed total scores over the N per- significance ðαÞ.
sons and s2 is the sample variance. It turns out
that by substituting σ^2 and s2 in the definition of
reliability, the reliability of the n-item test equals Hypothesis Testing and Beta
the Kuder–Richardson formula 21 under the bino-
mial error model. Hypothesis testing is a very important part of sta-
tistical inference: the formal process of deciding
whether a particular contention (called the null
History hypothesis) is supported by the data, or whether
The name Bernoulli was taken from Jakob Ber- a second contention (called the alternative hypoth-
noulli, a Swiss mathematician in the 17th century. esis) is preferred. In this context, one can represent
He made many contributions to mathematics, espe- the situation in a simple 2 × 2 decision table in
cially in calculus and probability theory. He is the which the columns reflect the true (unobservable)
first person who expressed the idea of the law of situation and the rows reflect the inference made
large numbers, along with its mathematical proof based on a set of data:
(thus, the law is also called Bernoulli’s theorem).
Bernoulli derived the binomial distribution in the Null Alternative
case in which the probability p is a rational number, Hypothesis Is Hypothesis Is
and his result was published in 1713. Later in the Decision True/Preferred True/Preferred
18th century, Thomas Bayes generalized Bernoulli’s Fail to Correct Type II error
binomial distribution by removing its rational reject decision
restriction on p in his formulation of a statistical null hypothesis
theory that is now known as Bayesian statistics. Reject null Type I error Correct decision
hypothesis in favor
Kentaro Kato and William M. Bart of alternative
hypothesis
See also Logistic Regression; Normal Distribution; Odds
Ratio; Poisson Distribution; Probability, Laws of
The language used in the decision table is subtle
Further Readings but deliberate. Although people commonly speak
of accepting hypotheses, under the maxim that sci-
Agresti, A. (2002). Categorical data analysis (2nd ed.). entific theories are not so much proven as sup-
New York: Wiley.
ported by evidence, we might more properly speak
Johnson, N. L., Kemp, A. W., & Kotz, S. (2005).
Univariate discrete distributions (3rd ed.). Hoboken,
of failing to reject a hypothesis rather than of
NJ: Wiley. accepting it. Note also that it may be the case that
Lindgren, B. W. (1993). Statistical theory (4th ed.). Boca neither the null nor the alternative hypothesis is, in
Raton, FL: Chapman & Hall/CRC. fact, true, but generally we might think of one as
Lord, F. M., & Novick, M. R. (1968). Statistical theories preferable over the other on the basis of evi-
of mental test scores. Reading, MA: Addison-Wesley. dence. Semantics notwithstanding, the decision
table makes clear that there exist two distinct
possible types of error: that in which the null
hypothesis is rejected when it is, in fact, true;
BETA and that in which the null hypothesis is not
rejected when it is, in fact, false. A simple exam-
Beta ðβÞ refers to the probability of Type II error ple that helps one in thinking about the differ-
in a statistical hypothesis test. Frequently, the ence between these two types of error is
82 Beta
a criminal trial in the U.S. judicial system. In say that ‘‘a scientific fact should be regarded as
that system, there is an initial presumption of experimentally established only if a properly
innocence (null hypothesis), and evidence is pre- designed experiment rarely fails to give this level of
sented in order to reach a decision to convict significance’’ (Fisher, 1926, p. 504).
(reject the null hypothesis) or acquit (fail to Although it is not generally possible to control
reject the null). In this context, a Type I error is both α and β for a test with a fixed sample size, it
committed if an innocent person is convicted, is typically possible to decrease β while holding α
while a Type II error is committed if a guilty per- constant if the sample size is increased. As a result,
son is acquitted. Clearly, both types of error can- a simple way to conduct tests with high power
not occur in a single trial; after all, a person (low β) is to select a sample size sufficiently large
cannot be both innocent and guilty of a particu- to guarantee a specified power for the test. Of
lar crime. However, a priori we can conceive of course, such a sample size may be prohibitively
the probability of each type of error, with the large or even impossible, depending on the nature
probability of a Type I error called the signifi- and cost of the experiment. From a research design
cance level of a test and denoted by α, and the perspective, sample size is the most critical aspect
probability of a Type II error denoted by β, with of ensuring that a test has sufficient power, and
1 β, the probability of not committing a Type a priori sample size calculations designed to pro-
II error, called the power of the test. duce a specified power level are common when
designing an experiment or survey. For example, if
one wished to test the null hypothesis that a mean
Relationship With Significance
μ was equal to μ0 versus the alternative that μ
Just as it is impossible to realize both types of error was equal to μ1 > μ0 , the sample size required to
in a single test, it is also not possible to minimize ensure a Type II error of β if α ¼ :05 is
both α and β in a particular experiment with fixed n ¼ fσð1:645 Φ1 ðβÞÞ=ðμ1 μ0 Þg2 , where Φ is
sample size. In this sense, in a given experiment, the standard normal cumulative distribution func-
there is a trade-off between α and β, meaning that tion and σ is the underlying standard deviation, an
both cannot be specified or guaranteed to be low. estimate of which (usually the sample standard
For example, a simple way to guarantee no chance deviation) is used to compute the required sample
of a Type I error would be to never reject the null size.
hypothesis regardless of the data, but such a strat- The value of β for a test is also dependent on
egy would typically result in a very large β. Hence, the effect size—that is, the measure of how differ-
it is common practice in statistical inference to fix ent the null and alternative hypotheses are, or the
the significance level at some nominal, low value size of the effect that the test is designed to detect.
(usually .05) and to compute and report β in com- The larger the effect size, the lower β will typically
municating the result of the test. Note the implied be at fixed sample size, or, in other words, the
asymmetry between the two types of error possible more easily the effect will be detected.
from a hypothesis test: α is held at some prespeci-
fied value, while β is not constrained. The prefer- Michael A. Martin and Steven Roberts
ence for controlling α rather than β also has an
See also Hypothesis; Power; p Value; Type I Error;
analogue in the judicial example above, in which
Type II Error
the concept of ‘‘beyond reasonable doubt’’ captures
the idea of setting α at some low level, and where
there is an oft-stated preference for setting a guilty Further Readings
person free over convicting an innocent person,
Fisher, R. A. (1926). The arrangement of field
thereby preferring to commit a Type II error over
experiments. Journal of the Ministry of Agriculture of
a Type I error. The common choice of .05 for α Great Britain, 23, 503–513.
most likely stems from Sir Ronald Fisher’s 1926 Lehmann, E. L. (1986). Testing statistical hypotheses.
statement that he ‘‘prefers to set a low standard of New York: Wiley.
significance at the 5% point, and ignore entirely all Moore, D. (1979). Statistics: Concepts and controversies.
results that fail to reach that level.’’ He went on to San Francisco: W. H. Freeman.
Bias 83
randomly assign participants to the intervention or indicates liberal political attitudes rather than
the control group. If the collaborators sometimes sometimes indicating conservative attitudes and
broke with random assignment and assigned the sometimes indicating liberal attitudes), it is possi-
juveniles who were most in need (e.g., had the ble for researchers to over- or underestimate the
worst criminal records) to the intervention group, favorability of participants’ attitudes, whether
then when both groups were subsequently fol- participants possess a particular trait, or the
lowed to determine whether they continued to likelihood that they will engage in a particular
break the law (or were caught doing so), the selec- behavior.
tion bias would make it difficult to find a difference
between the two groups. The preintervention dif-
ferences in criminal behavior between the interven- Avoiding Bias
tion and control groups might mask any effect of Careful research design can minimize systematic
the intervention or even make it appear as if the errors in collected data. Random sampling reduces
intervention increased criminal behavior. sample bias. Random assignment to condition
minimizes or eliminates selection bias. Ensuring
Experimenter Expectancy Effects that experimenters are blind to experimental con-
ditions eliminates the possibility that experimenter
Researchers usually have hypotheses about how expectancies will influence participant behavior or
subjects will perform under different experimental bias the data collected. Bias reduction improves
conditions. When a researcher knows which researchers’ ability to generalize findings and to
experimental group a subject is assigned to, the draw causal conclusions from the data.
researcher may unintentionally behave differently
toward the participant. The different treatment, Margaret Bull Kovera
which systematically varies with the experimental
condition, may cause the participant to behave in See also Experimenter Expectancy Effect; Response Bias;
a way that confirms the researcher’s hypothesis or Sampling; Selection; Systematic Error
expectancy, making it impossible to determine
whether it is the difference in the experimenter’s
Further Readings
behavior or in the experimental conditions that
causes the change in the subject’s behavior. Robert Larzelere, R. E., Kuhn, B. R., & Johnson, B. (2004). The
Rosenthal and his colleagues were among the first intervention selection bias: An unrecognized confound
to establish experimenter expectancy effects when in intervention research. Psychological Bulletin, 130,
they told teachers that some of their students had 289–303.
Rosenthal, R. (2002). Covert communication in
been identified as ‘‘late bloomers’’ whose academic
classrooms, clinics, courtrooms, and cubicles.
performance was expected to improve over the American Psychologist, 57, 839–849.
course of the school year. Although the students Rosenthal, R., & Rosnow, R. L. (1991). Essentials of
chosen to be designated as late bloomers had in behavioral research methods and data analysis
fact been selected randomly, the teachers’ expecta- (Chapter 10, pp. 205–230). New York: McGraw-Hill.
tions about their performance appeared to cause Welkenhuysen-Gybels, J., Billiet, J., & Cambré, B.
these students to improve. (2003). Adjustment for acquiescence in the assessment
of the construct equivalence of Likert-type score items.
Journal of Cross-Cultural Psychology, 34, 702–722.
Response Bias
Another source of systematic error comes from
participant response sets, such as the tendency for
participants to answer questions in an agreeable BIASED ESTIMATOR
manner (e.g., ‘‘yes’’ and ‘‘agree’’), known as an
acquiescent response set. If all the dependent mea- In many scientific research fields, statistical models
sures are constructed such that agreement with an are used to describe a system or a population, to
item means the same thing (e.g., agreement always interpret a phenomenon, or to investigate the
Biased Estimator 85
rently enrolled in this university, and the would yield a biased estimator. A heuristic argu-
population mean of the amount of credit card debt ment is given here. If μ were known,
P
n
of these undergraduate students, denoted by θ, is 1
n ðXi μÞ2 could be calculated, which would
the parameter of interest. To estimate θ, a random i¼1
sample is collected from the university, and the be an unbiased estimator for σ 2. But since μ is not
sample mean of the amount of credit card debt is known, it has to be replaced by X. This replace-
calculated. Denote this sample mean by θ^1. Then ment actually makes the numerator smaller. That
n
P 2 P n
E θ^1 ¼ θ; that is, θ^1 is an unbiased estimator. If is, Xi X ≤ ðXi μÞ2 regardless of the
the largest amount of credit card debt from the i¼1 i¼1
sample, call it θ^2 , is used to estimate θ, then obvi- value of μ. Therefore, the denominator has to be
reduced a little bit (from n to n 1) accordingly.
ously θ^2 is biased. In other words, E θ^2 6¼ θ.
A closely related concept is the bias of an esti-
mator, which is defined as E θ^ θ. Therefore, an
unbiased estimator can also be defined as an esti-
Example 2 mator whose bias is zero, while a biased estimator
In this example a more abstract scenario is exam- is one whose bias is nonzero. A biased estimator is
ined. Consider a statistical model in which a ran- said to underestimate the parameter if the bias is
dom variable X follows a normal distribution with negative or overestimate the parameter if the bias
mean μ and variance σ 2 , and suppose a random is positive.
sample X1 , . . . , Xn is observed. Let the parameter θ Biased estimators are usually not preferred in
P
n estimation problems, because in the long run,
be μ. It is seen in Example 1 that X ¼ 1n Xi , the they do not provide an accurate ‘‘guess’’ of the
i¼1
parameter. Sometimes, however, cleverly con-
sample mean of X1 , . . . , Xn , is an unbiased estima-
2 structed biased estimators are useful because
tor for θ. But X is a biased estimator for μ2 (or although their expectation does not equal the
θ2 ). This is because X follows a normal parameter under estimation, they may have a small
86 Bivariate Regression
variance. To this end, a criterion that is quite com- See also Distribution; Estimation; Expected Value
monly used in statistical science for judging the
quality of an estimator needs to be introduced.
Further Readings
The mean square error (MSE) of an estimatorh θ^
2 i Rice, J. A. (1994). Mathematical statistics and data
for the parameter θ is defined as E θ^ θ . analysis (2nd ed.). Belmont, CA: Duxbury Press.
Apparently, one should seek estimators that make
the MSE small, which means that θ^ is ‘‘close’’ to θ.
Notice that
BIVARIATE REGRESSION
h 2 i h
2 i
2
E θ^ θ ¼ E θ^ E θ^ þ E θ^ θ
Regression is a statistical technique used to help
¼ Var θ^ þ Bias2 , investigate how variation in one or more variables
predicts or explains variation in another variable.
meaning that the magnitude of the MSE, which is This popular statistical technique is flexible in that
always nonnegative, is determined by two compo- it can be used to analyze experimental or nonex-
nents: the variance and the bias of the estimator. perimental data with multiple categorical and con-
Therefore, an unbiased estimator (for which the tinuous independent variables. If only one variable
bias would be zero), if possessing a large variance, is used to predict or explain the variation in
may be inferior to a biased estimator whose vari- another variable, the technique is referred to as
ance and bias are both small. One of the most pro- bivariate regression. When more than one variable
minent examples is the shrinkage estimator, in is used to predict or explain variation in another
which a small amount of bias for the estimator variable, the technique is referred to as multiple
gains a great reduction of variance. Example 4 is regression. Bivariate regression is the focus of this
a more straightforward example of the usage of entry.
a biased estimator. Various terms are used to describe the indepen-
dent variable in regression, namely, predictor vari-
able, explanatory variable, or presumed cause.
The dependent variable is often referred to as an
Example 4 outcome variable, criterion variable, or presumed
Let X be a Poisson random variable, that is, effect. The choice of independent variable term
λ x
PðX ¼ xÞ ¼ e x!λ , for x ¼ 0, 1, 2, . . . . Suppose the will likely depend on the preference of the
researcher or the purpose of the research. Bivariate
parameter θ ¼ e2λ, which is essentially
regression may be used solely for predictive pur-
½PðX ¼ 0Þ2 , is of interest and needs to be esti- poses. For example, do scores on a college
mated. If an unbiased estimator, say θ^1 ðXÞ, for θ is entrance exam predict college grade point average?
desired, then by the definition of unbiasedness, it Or it may be used for explanation. Do differences
P∞ eλ λx
must satisfy ^ 2λ in IQ scores explain differences in achievement
x¼0 θ1 ðxÞ x! ¼ e or, equiva-
P∞ θ^1 ðxÞλx scores? It is often the case that although the term
lently, x¼0 x! ¼ eλ for all positive values of
predictor is used by researchers, the purpose of the
λ. Clearly, the only solution is that θ^1 ðxÞ ¼ ð1Þx . research is, in fact, explanatory.
But this unbiased estimator is rather absurd. For Suppose a researcher is interested in how well
example, if X ¼ 10, then the estimator θ^1 takes reading in first grade predicts or explains fifth-
the value of 1, whereas if X ¼ 11, then θ^1 is 1. grade science achievement scores. The researcher
As a matter of fact, a much more reasonable esti- hypothesizes that those who read well in first
mator would be θ^2 ðXÞ ¼ e2X , based on the maxi- grade will also have high science achievement in
mum likelihood approach. This estimator is biased fifth grade. An example bivariate regression will
but always has a smaller MSE than θ^1 ðXÞ: be performed to test this hypothesis. The data used
in this example are a random sample of students
Zhigang Zhang and Qianxing Mo (10%) with first-grade reading and fifth-grade
Bivariate Regression 87
science scores and are taken from the Early Because science scores are the outcome, the sci-
Childhood Longitudinal Study public database. ence scores are regressed on first-grade reading
Variation in reading scores will be used to explain scores. The easiest way to conduct such analysis is
variation in science achievement scores, so first- to use a statistical program. The estimates from the
grade reading achievement is the explanatory vari- output may then be plugged into the equation.
able and fifth-grade science achievement is the out- For these data, the prediction equation is
come variable. Before the analysis is conducted, Y 0 ¼ 21:99 þ ð:58ÞX: Therefore, if a student’s first-
however, it should be noted that bivariate regres- grade reading score was 60, the predicted fifth-
sion is rarely used in published research. For grade science achievement score for that student
example, intelligence is likely an important com- would be 21.99 þ (.58)60, which equals 56.79.
mon cause of both reading and science achieve- One might ask, why even conduct a regression
ment. If a researcher was interested in explaining analysis to obtain a predicted science score when
fifth-grade science achievement, then potential Johnny’s science score was already available? There
important common causes, such as intelligence, are a few possible reasons. First, perhaps
would need to be included in the research. a researcher wants to use the information to pre-
dict later science performance, either for a new
group of students or for an individual student,
based on current first-grade reading scores. Second,
Regression Equation
a researcher may want to know the relation
The simple equation for bivariate linear regression between the two variables, and a regression pro-
is Y ¼ a þ bX þ e: The science achievement score, vides a nice summary of the relation between the
Y, for a student equals the intercept or constant scores for all the students. For example, do those
(a), plus the slope (b) times the reading score (X) students who tend to do well in reading in first
for that student, plus error (e). Error, or the residual grade also do well in science in fifth grade? Last,
component (e), represents the error in prediction, a researcher might be interested in different out-
or what is not explained in the outcome variable. comes related to early reading ability when consid-
The error term is not necessary and may be ering the possibility of implementing an early
dropped so that the following equation is used: reading intervention program. Of course a bivariate
Y 0 ¼ a þ bX: Y 0 is the expected (or predicted) relation is not very informative. A much more
score. The intercept is the predicted fifth-grade sci- thoughtfully developed causal model would need
ence score for someone whose first-grade reading to be developed if a researcher was serious about
score is zero. The slope (b, also referred to as the this type of research.
unstandardized regression coefficient) represents
the predicted unit increase in science scores associ-
Scatterplot and Regression Line
ated with a one-unit increase in reading scores. X is
the observed score for that person. The two para- The regression equation describes the linear rela-
meters (a and b) that describe the linear relation tion between variables; more specifically, it
between the predictor and outcome are thus the describes science scores as a function of reading
intercept and the regression coefficient. These para- scores. A scatterplot could be used to represent the
meters are often referred to as least squares estima- relation between these two variables, and the use
tors and will be estimated using the two sets of of a scatterplot may assist one in understanding
scores. That is, they represent the optimal estimates regression. In a scatterplot, the science scores (out-
that will provide the least error in prediction. come variable) are on the y-axis, and the reading
Returning to the example, the data used in the scores (explanatory variable) are on the x-axis.
analysis were first-grade reading scores and fifth- A scatterplot is shown in Figure 1. Each per-
grade science scores obtained from a sample of son’s reading and science scores in the sample are
1,027 school-age children. T-scores, which have plotted. The scores are clustered fairly closely
a mean of 50 and standard deviation of 10, were together, and the general direction looks to be posi-
used. The means for the scores in the sample were tive. Higher scores in reading are generally associ-
51.31 for reading and 51.83 for science. ated with higher scores in science. The next step is
88 Bivariate Regression
70.000
coefficient. It was statistically signifi-
60.000 cant, indicating that reading has a statis-
tically significant influence on fifth-
50.000 grade science. A 1-point T-score
increase in reading is associated with
40.000 a .58 T-score point increase in science
scores. The bs are interpreted in the
30.000 R 2 linear = 0.311
metric of the original variable. In the
example, all the scores were T-scores.
20.000 Unstandardized coefficients are espe-
20.000 40.000 60.000 80.000 cially useful for interpretation when the
First-Grade Reading T-Scores metric of the variables is meaningful.
Sometimes, however, the metric of the
Figure 1 Scatterplot and Regression Line
variables is not meaningful.
Two equations were generated in the
regression analysis. The first, as dis-
to fit a regression line. The regression line is plotted cussed in the example above, is referred to as the
so that it minimizes errors in prediction, or simply, unstandardized solution. In addition to the unstan-
the regression line is the line that is closest to all the dardized solution, there is a standardized solution.
data points. The line is fitted automatically in many In this equation, the constant is dropped, and z
computer programs, but information obtained in scores (mean = 0, standard deviation = 1), rather
the regression analysis output can also be used to than the T-scores (or raw scores), are used. The
plot two data points that the line should be drawn standardized regression coefficient is referred to as
through. For example, the intercept (where the line a beta weight ðβÞ. In the example, the beta
crosses the y-axis) represents the predicted science weight was .56. Therefore, a one-standard-devia-
score when reading equals zero. Because the value tion increase in reading was associated with a .56-
of the intercept was 21.99, the first data point standard-deviation increase in science achievement.
would be found at 0 on the x-axis and at 21.99 on The unstandardized and standardized coefficients
the y-axis. The second point on the line may be were similar in this example because T-scores are
located at the mean reading score (51.31) and mean standardized scores, and the sample statistics for
science score (51.83). A line can then be drawn the T-scores were fairly close to the population
through those two points. The line is shown in Fig- mean of 50 and standard deviation of 10.
ure 1. Points that are found along this regression It is easy to convert back and forth from stan-
line represent the predicted science achievement dardized to unstandardized regression coefficients:
score for Person A with a reading score of X.
standard deviation of reading scores
β¼b
standard deviation of science scores
Unstandardized and Standardized Coefficients
For a more thorough understanding of bivariate or
regression, it is useful to examine in more detail
the output obtained after running the regression. standard deviation of science scores
b¼β :
First, the intercept has no important substantive standard deviation of reading scores
meaning. It is unlikely that anyone would score
a zero on the reading test, so it does not make From an interpretative standpoint, should some-
much sense. It is useful in the unstandardized solu- one interpret the unstandardized or the standard-
tion in that it is used to obtain predicted scores (it ized coefficient? There is some debate over which
Bivariate Regression 89
one to use for interpretative statements, but in use an F test associated with the value obtained
a bivariate regression, the easiest answer is that if with the formula
both variables are in metrics that are easily inter-
pretable, then it would make sense to use the R2 =k
:
unstandardized coefficients. If the metrics are not ð1 R2 Þ=ðN k 1Þ
meaningful, then it may make more sense to use the
standardized coefficient. Take, for example, number In this formula, R2 equals the variance explained,
of books read per week. If number of books read 1 R2 is the variance unexplained, and k equals
per week was represented by the actual number of the degrees of freedom (df) for the regression
books read per week, the variable is in a meaningful (which is 1 because one explanatory variable was
metric. If the number of books read per week vari- used). With the numbers plugged in, the formula
able were coded so that 0 = no books read per would look like
week, 1 = one to three books read per week, and
:31=1
2 = four or more books read per week, then the var-
iable is not coded in a meaningful metric, and the :69=ð1027 1 1Þ
standardized coefficient would be the better one to
and results in F ¼ 462:17: An F table indicates
use for interpretation.
that reading did have a statistically significant
effect on science achievement, R2 ¼ :31,
R and R2 Fð1,1025Þ ¼ 462:17, p < :01:
In standard multiple regression, a researcher
In bivariate regression, typically the regression
typically interprets the statistical significance of R2
coefficient is of greatest interest. Additional infor-
(the statistical significance of the overall equation)
mation is provided in the output, however. R is
and the statistical significance of the unique effects
used in multiple regression output and represents
of each individual explanatory variable. Because
a multiple correlation. Because there is only one
this is bivariate regression, however, the statistical
explanatory variable, R (.56) is equal to the corre-
significance test of the overall regression and the
lation coefficient (r = .56) between reading and sci-
regression coefficient (b) will yield the same
ence scores. Note that this value is also identical to
results, and typically the statistical significance
the β. Although the values of β and r are the
tests for each are not reported.
same, the interpretation differs. The researcher is
The statistical significance of the regression
not proposing an agnostic relation between read-
coefficient (b) is evaluated with a t test. The null
ing scores and science scores. Rather the researcher
hypothesis is that the slope equals zero, that is, the
is positing that early reading explains later science
regression line is parallel with the x-axis. The
achievement. Hence, there is a clear direction in
t-value is obtained by
the relation, and this direction is not specified in
a correlation. b
R2 is the variance in science scores explained :
standard error of b
by reading scores. In the current example,
R2 ¼ 31: First-grade reading scores explained In this example, b ¼ :58; and its associated stan-
31% of the variance in fifth-grade science dard error was .027. The t-value was 21.50. A
achievement scores. t-table could be consulted to determine whether
21.50 is statistically significant. Or a rule of thumb
may be used that given the large sample size and
Statistical Significance
with a two-tailed significance test, a t-value greater
R and R2 are typically used to evaluate the statisti- than 2 will be statistically significant at the p < :05
cal significance of the overall regression equation level. Clearly, the regression coefficient was statisti-
(the tests for the two will result in the same cally significant. Earlier it was mentioned that
answer). The null hypothesis is that R2 equals zero because this is a bivariate regression, the signifi-
in the population. One way of calculating the sta- cance of the overall regression and b provide redun-
tistical significance of the overall regression is to dant information. The use of F and t tests may thus
90 Bivariate Regression
be confusing, but note that F (462.17) equals Each person’s residual is thus represented by the
t2 ð21:502 ) in this bivariate case. A word of caution: distance between the observed score and the
This finding does not generalize to multiple regres- regression line. Because the regression line repre-
sion. In fact, in a multiple regression, the overall sents the predicted scores, the residuals are the dif-
regression might be significant, and some of the bs ference between predicted and observed scores.
may or may not be significant. In a multiple regres- Again, the regression line minimizes the distance
sion, both the overall regression equation and the of these residuals from the regression line. Much
individual coefficients are examined for statistical as residuals are thought of as science scores with
significance. the effects of reading scores removed, the residual
variance is the proportion of variance in science
scores left unexplained by reading scores. In the
Residuals
example, the residual variance was .69, or 1 R2.
Before completing this explanation of bivariate
regression, it will be instructive to discuss a topic
that has been for the most part avoided until now: Regression Interpretation
the residuals. Earlier it was mentioned that e
(residual) was also included in the regression equa- An example interpretation for the reading and sci-
tion. Remember that regression parameter esti- ence example concludes this entry on bivariate
mates minimize the prediction errors, but the regression. The purpose of this study was to deter-
prediction is unlikely to be perfect. The residuals mine how well first-grade science scores explained
represent the error in prediction. Or the residual fifth-grade science achievement scores. The regres-
variance represents the variance that is left unex- sion of fifth-grade science scores on first-grade
plained by the explanatory variable. Returning to reading scores was statistically significant,
the example, if reading scores were used to predict R2 ¼ :31, Fð1; 1025Þ ¼ 462:17, p < :01: Reading
science scores for those 1,026 students, each stu- accounted for 31% of the variance in science
dent would have a prediction equation in which achievement. The unstandardized regression coeffi-
his or her reading score would be used to calculate cient was .58, meaning that for each T-score point
a predicted science score. Because the actual score increase in reading, there was a .58 T-score
for each person is also known, the residual for increase in science achievement. Children who are
each person would represent the observed fifth- better readers in first grade also tend to be higher
grade science score minus the predicted score achievers in fifth-grade science.
obtained from the regression equation. Residuals
Matthew R. Reynolds
are thus observed scores minus predicted scores, or
conceptually they may be thought of as the fifth- See also Correlation; Multiple Regression; Path Analysis;
grade science scores with effects of first-grade Scatterplot; Variance
reading removed.
Another way to think of the residuals is to
revert back to the scatterplot in Figure 1. The x-
Further Readings
axis represents the observed scores, and the y-axis
represents the science scores. Both predicted and Bobko, P. (2001). Correlation and regression: Principles
actual scores are already plotted on this scatter- and applications for industrial/organizational
plot. That is, the predicted scores are found on the psychology and management (2nd ed.). Thousand
regression line. If a person’s reading score was 40, Oaks, CA: Sage.
the predicted science score may be obtained by Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Applied multiple regression/correlation analysis for the
first finding 40 on the x-axis, and then moving up
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
in a straight line until reaching the regression line. Erlbaum.
The observed science scores for this sample are Keith, T. Z. (2006). Multiple regression and beyond.
also shown on the plot, represented by the dots Boston: Pearson.
scattered about the regression line. Some are very McDonald, R. P. (1999). Test theory: A unified
close to the line whereas others are farther away. treatment. Mahwah, NJ: Lawrence Erlbaum.
Block Design 91
Miles, J., & Shevlin, M. (2001). Applying regression and facilities. The salient features of the five most often
correlation: A guide for students and researchers. used block designs are described next.
Thousand Oaks, CA: Sage.
Schroeder, L. D., Sjoquist, D. L., & Stephan, P. E. (1986).
Understanding regression analysis: An introductory Block Designs With One Treatment
guide. Beverly Hills, CA: Sage.
Weisburg, S. (2005). Applied linear regression (3rd ed.).
Dependent Samples t-Statistic Design
Hoboken, NJ: Wiley. The simplest block design is the randomization
and analysis plan that is used with a t statistic for
dependent samples. Consider an experiment to
compare two ways of memorizing Spanish vocabu-
lary. The dependent variable is the number of trials
BLOCK DESIGN required to learn the vocabulary list to the crite-
rion of three correct recitations. The null and alter-
Sir Ronald Fisher, the father of modern experimen- native hypotheses for the experiment are,
tal design, extolled the advantages of block designs respectively,
in his classic book, The Design of Experiments.
He observed that block designs enable researchers H0: μ1 μ2 ¼ 0
to reduce error variation and thereby obtain more
powerful tests of false null hypotheses. In the and
behavioral sciences, a significant source of error
variation is the nuisance variable of individual dif-
H1: μ1 μ2 6¼ 0;
ferences. This nuisance variable can be isolated by
assigning participants or experimental units to
blocks so that at the beginning of an experiment, where μ1 and μ2 denote the population means for
the participants within a block are more homoge- the two memorization approaches. It is reasonable
neous with respect to the dependent variable than to believe that IQ is negatively correlated with the
are participants in different blocks. Three proce- number of trials required to memorize Spanish
dures are used to form homogeneous blocks. vocabulary. To isolate this nuisance variable, n
blocks of participants can be formed so that the
1. Match participants on a variable that is two participants in each block have similar IQs. A
correlated with the dependent variable. Each simple way to form blocks of matched participants
block consists of a set of matched participants. is to rank the participants in terms of IQ. The par-
ticipants ranked 1 and 2 are assigned to Block 1,
2. Observe each participant under all or a portion
of the treatment levels or treatment
those ranked 3 and 4 are assigned to Block 2, and
combinations. Each block consists of a single so on. Suppose that 20 participants have volun-
participant who is observed two or more times. teered for the memorization experiment. In this
Depending on the nature of the treatment, case, n ¼ 10 blocks of dependent samples can
a period of time between treatment level be formed. The two participants in each block
administrations may be necessary in order for are randomly assigned to the memorization
the effects of one treatment level to dissipate approaches. The layout for the experiment is
before the participant is observed under other shown in Figure 1.
levels. The null hypothesis is tested using a t statistic
3. Use identical twins or litter mates. Each block for dependent samples. If the researcher’s hunch is
consists of participants who have identical or correct—that IQ is correlated with the number of
similar genetic characteristics. trials to learn—the design should result in a more
powerful test of a false null hypothesis than would
Block designs also can be used to isolate other a t-statistic design for independent samples. The
nuisance variables, such as the effects of adminis- increased power results from isolating the nuisance
tering treatments at different times of day, on dif- variable of IQ so that it does not appear in the
ferent days of the week, or in different testing estimates of the error effects.
92 Block Design
Figure 1 Layout for a Dependent Samples t-Statistic SSTOTAL ¼ SSA þ SSBLOCKS þ SSRESIDUAL
Design np 1 ¼ ðp 1Þ þ ðn 1Þ þ ðn 1Þðp 1Þ,
Notes: aj denotes a treatment level (Treat. Level); Yij denotes
a measure of the dependent variable (Dep. Var.). Each block where SSA denotes the Treatment A SS and
in the memorization experiment contains two matched SSBLOCKS denotes the blocks SS: The
participants. The participants in each block are randomly SSRESIDUAL is the interaction between Treat-
assigned to the treatment levels. The means of the treatments
levels are denoted by Y · 1 and Y · 2.
ment A and blocks; it is used to estimate error
effects. Many test statistics can be thought of as
a ratio of error effects and treatment effects as
follows:
Randomized Block Design
f ðerror effectsÞþ f ðtreatment effectsÞ
The randomized block analysis of variance Test statistic ¼ ,
f ðerror effectsÞ
design can be thought of as an extension of
a dependent samples t-statistic design for the case
in which the treatment has two or more levels. where f ðÞ denotes a function of the effects in
The layout for a randomized block design with parentheses. The use of a block design enables
p ¼ 3 levels of Treatment A and n ¼ 10 blocks is a researcher to isolate variation attributable to
shown in Figure 2. A comparison of this layout the blocks variable so that it does not appear in
with that in Figure 1 for the dependent samples estimates of error effects. By removing this nui-
t-statistic design reveals that the layouts are the sance variable from the numerator and denomi-
same except that the randomized block design has nator of the test statistic, a researcher is
three treatment levels. rewarded with a more powerful test of a false
In a randomized block design, a block might null hypothesis.
contain a single participant who is observed Two null hypotheses can be tested in a random-
under all p treatment levels or p participants ized block design. One hypothesis concerns the
who are similar with respect to a variable that is equality of the Treatment A population means;
correlated with the dependent variable. If each the other hypothesis concerns the equality of the
block contains one participant, the order in blocks population means. For this design and
which the treatment levels are administered is those described later, assume that the treatment
randomized independently for each block, represents a fixed effect and the nuisance variable,
assuming that the nature of the research hypoth- blocks, represents a random effect. For this mixed
esis permits this. If a block contains p matched model, the null hypotheses are
participants, the participants in each block are
randomly assigned to the treatment levels.
The statistical analysis of the data is the same H0: μ · 1 ¼ μ · 2 ¼ ¼ μ · p
whether repeated measures or matched participants ðtreatment A population means are equalÞ
Block Design 93
Figure 2 Layout for a Randomized Block Design With p ¼ 3 Treatment Levels and n ¼ 10 Blocks
Notes: aj denotes a treatment level (Treat. Level); Yij denotes a measure of the dependent variable (Dep. Var.). Each block
contains three matched participants. The participants in each block are randomly assigned to the treatment levels. The means
of treatment A are denoted by and Y · 1 and Y · 2 and Y · 3 and the means of the blocks are denoted by Y · 1 ; . . . ; Y 10 · :
Figure 3 Generalized Randomized Block Design With N ¼ 30 Participants, p ¼ 3 Treatment Levels, and w ¼ 5
Groups of np ¼ ð2Þð3Þ ¼ 6 Homogeneous Participants
In the memorization experiment described earlier, groups. The within-cells SS; SSWCELL; is used to
suppose that 30 volunteers are available. The 30 par- estimate error effects. Three null hypotheses can
ticipants are ranked with respect to IQ. The be tested:
np ¼ ð2Þð3Þ ¼ 6 participants with the highest IQs
are assigned to Group 1, the next 6 participants are 1: H0: μ1 · ¼ μ2 · ¼ ¼ μp ·
assigned to Group 2, and so on. The np ¼ 6 partici- ðTreatment A population means are equalÞ;
pants in each group are then randomly assigned to
the p ¼ 3 treatment levels with the restriction that 2: H0: σ 2G ¼ 0
n ¼ 2 participants are assigned to each level. ðVariance of the groups; G; population
The total SS and total degrees of freedom are
partitioned as follows: means is equal to zeroÞ;
3: H0: σ 2A × G ¼ 0
SSTOTAL ¼ SSA þ SSG þ SSA × G þ SSWCELL
npw 1 ¼ ðp 1Þ þ ðw 1Þ þ ðp 1Þðw 1Þ þ pwðn 1Þ; ðVariance of the A × G interaction
is equal to zeroÞ;
where SSG denotes the groups SS and SSA × G where μijz denotes a population mean for the ith
denotes the interaction of Treatment A and participant in the jth treatment level and zth group.
Block Design 95
Block1 a1b1 Y111 a1b2 Y112 a2b1 Y121 a2b2 Y122 Y1..
Block2 a1b1 Y211 a1b2 Y212 a2b1 Y221 a2b2 Y222 Y2..
Block3 a1b1 Y311 a1b2 Y312 a2b1 Y321 a2b2 Y322 Y3..
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
Block10 a1b1 Y10, 11 a1b2 Y10, 12 a2b1 Y10, 21 a2b2 Y10, 22 Y10..
Figure 4 Layout for a Two-Treatment, Randomized Block Factorial Design in Which Four Homogeneous Participants
Are Randomly Assigned to the pq ¼ 2 × 2 ¼ 4 Treatment Combinations in Each Block
Note: aj bk denotes a treatment combination (Treat. Comb.); Yijk denotes a measure of the dependent variable (Dep. Var.).
The three null hypotheses are tested using the fol- Randomized Block Factorial Design
lowing F statistics:
A randomized block factorial design with two
SSA=ðp 1Þ MSA treatments, denoted by A and B, is constructed by
1. F ¼ ¼ ,
SSWCELL=½pwðn 1Þ MSWCELL crossing the p levels of Treatment A with the
SSG=ðw 1Þ MSG q levels of Treatment B. The design’s n blocks
2. F ¼ ¼
SSWCELL=½pwðn 1Þ MSWCELL
, each contain p × q treatment combinations:
a1 b1 , a1 b2 . . . ap bq : The design enables a researcher
SSA × G=ðp 1Þðw 1Þ MSA × G to isolate variation attributable to one nuisance
3. F ¼ ¼ :
SSWCELL=½pwðn 1Þ MSWCELL variable while simultaneously evaluating two treat-
The generalized randomized block design ments and associated interaction.
enables a researcher to isolate one nuisance vari- The layout for the design with p ¼ 2 levels of
able—an advantage that it shares with the ran- Treatment A and q ¼ 2 levels of Treatment B is
domized block design. Furthermore, the design shown in Figure 4. It is apparent from Figure 4
uses the within-cell variation in the that all the participants are used in simultaneously
pw ¼ ð3Þð5Þ ¼ 15 cells to estimate error effects evaluating the effects of each treatment. Hence,
rather than an interaction, as in the randomized the design permits efficient use of resources
block design. Hence, the restrictive sphericity because each treatment is evaluated with the same
assumption of the randomized block design is precision as if the entire experiment had been
replaced with the assumption of homogeneity of devoted to that treatment alone.
within-cell population variances. The total SS and total degrees of freedom for
a two-treatment randomized block factorial design
are partitioned as follows:
Block Designs With Two or More Treatments
SSTOTAL ¼ SSBL þ SSA þ SSB
The blocking procedure that is used with a ran- npq 1 ¼ ðn 1Þ þ ðp 1Þ þ ðq 1Þ
domized block design can be extended to experi-
ments that have two or more treatments, denoted þ SSA × B þ SSRESIDUAL
by the letters A, B, C, and so on. þ ðp 1Þðq 1Þ þ ðn 1Þðpq 1Þ:
96 Block Design
Four null hypotheses can be tested: are present. The design has another disadvan-
tage: If Treatment A or B has numerous levels,
1. H0: σ 2BL ¼ 0 (Variance of the blocks, BL, say four or five, the block size becomes prohibi-
population means is equal to zero), tively large. For example, if p ¼ 4 and q ¼ 3, the
design has blocks of size 4 × 3 ¼ 12: Obtaining n
2. H0: μ · 1 · ¼ μ · 2 · ¼ ¼ μ · p · (Treatment A blocks with 12 matched participants or observ-
population means are equal), ing n participants on 12 occasions is often not
feasible. A design that reduces the size of the
3. H0: μ · · 1 ¼ μ · · 2 ¼ ¼ μ · · q (Treatment B blocks is described next.
population means are equal),
Y..1 Y..2
Figure 5 Layout for a Two-Treatment, Split-Plot Factorial Design in Which 10 þ 10 ¼ 20 Homogeneous Blocks Are
Randomly Assigned to the Two Groups
Notes: aj bk denotes a treatment combination (Treat. Comb.); Yijk denotes a measure of the dependent variable (Dep. Var.).
Treatment A is confounded with groups. Treatment B and the A × B are not confounded.
where μijk denotes the ith block, jth level of treat- tests of Treatment B and the A × B interaction is
ment A, and kth level of treatment B. The F greater than that for Treatment A.
statistics are
Roger E. Kirk
SSB=ðq 1Þ
F¼ Further Readings
SSRESIDUAL=½pðn 1Þðq 1Þ
Dean, A., & Voss, D. (1999). Design and analysis of
experiments. New York: Springer-Verlag.
MSB Kirk, R. E. (1995). Experimental design: Procedures for
¼ ,
MSRESIDUAL the behavioral sciences (3rd ed.). Pacific Grove, CA:
Brooks/Cole.
Kirk, R. E. (2002). Experimental design. In I. B. Weiner
SSA × B=ðp 1Þðq 1Þ (Series Ed.) & J. Schinka & W. F. Velicer (Vol. Eds.),
F¼
SSRESIDUAL=½pðn 1Þðq 1Þ Handbook of psychology: Vol. 2. Research methods in
psychology (pp. 3–32). New York: Wiley.
Maxwell, S. E., & Delaney, H. D. (2004). Designing
MSA × B experiments and analyzing data: A model comparison
¼ :
MSRESIDUAL perspective (2nd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Myers, J. L., & Well, A. D. (2003). Research design and
The split-plot factorial design uses two error
statistical analysis (2nd ed.). Mahwah, NJ: Lawrence
terms: MSBL(A) is used to test Treatment A; a dif- Erlbaum.
ferent and usually much smaller error term, Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
MSRESIDUAL; is used to test Treatment B and Experimental and quasi-experimental designs for
the A × B interaction. Because MSRESIDUAL is generalized causal inference. Boston: Houghton
generally smaller than MSBL(A), the power of the Mifflin.
98 Bonferroni Procedure
example. This entry also presents applications for represents the probability that at least one Type I
the procedure and examines recent research. error occurs in the k hypothesis tests. P Ai repre-
sents the probability of a Type I error in the ith
test, and we can label this probability as
Background αi ¼ P Ai . So Bonferroni’s inequality implies that
the probability of at least one Type I error occur-
The Bonferroni procedure is named after the P
k
Italian mathematician Carlo Emilio Bonferroni. ring in k hypothesis tests is ≤ αi :
i¼1
Bonferroni Procedure 99
If, as is often assumed, all k tests have the same gest an upper bound on this experiment-wise proba-
probability of a Type I error, α, then we can con- bility as .50—overly conservative by 10-fold! It
clude that the probability of at least one Type I would be unusual for a researcher to conduct k
error occurring in k hypothesis tests is ≤ kα. equivalent tests on the same data. However, it
Consider an illustration of Bonferroni’s inequal- would not be unusual for a researcher to conduct k
ity in the simple case in which k ¼ 2: Let the two tests and for many of those tests, if not all, to be
events A1 and A2 have probabilities P(A1) and partially interdependent. The more interdependent
P(A2), respectively. The sum of the probabilities of the tests are, the smaller the experiment-wise error
the two events is clearly greater than the probability rate and the more overly conservative the Bonfer-
of the union of the two events because the former roni procedure is.
counts the probability of the intersection of the two Other procedures have sought to correct for
events twice, as shown in Figure 1. inflation in experiment-wise error rates without
The Bonferroni procedure is simple in the sense being as conservative as the Bonferroni procedure.
that a researcher need only know the number of However, none are as simple to use. These other
tests to be performed and the probability of a Type procedures include the Student–Newman–Keuls,
I error for those tests in order to construct this Tukey, and Scheffé procedures, to name a few.
upper bound on the experiment-wise error rate. Descriptions of these other procedures and their
However, as mentioned earlier, the Bonferroni pro- uses can be found in many basic statistical meth-
cedure is often criticized for being too conservative. ods textbooks, as well as this encyclopedia.
Consider that the researcher does not typically
know what the actual Type I error rate is for a given
test. Rather, the researcher constructs the test so Example
that the maximum allowable Type I error rate is α. Consider the case of a researcher studying
Then the actual Type I error rate may be consider- the effect of three different teaching
ably less than α for any given test. methods on the average words per minute
For example, suppose a test is constructed with ðμ1 , μ2 , μ3 Þ at which a student can read.
a nominal α ¼ :05. Suppose the researcher con- The researcher tests three hypotheses:
ducts k ¼ 10 such tests on a given set of data, and μ1 ¼ μ2 ðvs: μ1 6¼ μ2 Þ; μ1 ¼ μ3 ðvs: μ1 6¼ μ3 Þ; and
the actual Type I error rate for each of the tests is μ2 ¼ μ3 ðvs: μ2 6¼ μ3 Þ. Each test is conducted at
.04. Using the Bonferroni procedure, the a nominal level, α0 ¼ :05; resulting in a comparison-
researcher concludes that the experiment-wise wise error rate of αc ¼ :05 for each test. Denote A1,
100 Bonferroni Procedure
A2, and A3 as the event of falsely rejecting the null when testing beta coefficients in a multiple
hypotheses 1, 2, and 3, respectively, and denote regression analysis, it has been shown that the
p1, p2, and p3 the probability of events A1, A2, overall Type I error rate in such an analysis
and A3, respectively. These would be the individual involving as few as eight regression coefficients
p values for these tests. It may be assumed that can exceed .30, resulting in almost a 1 in 3
some dependence exists among the three events, A1, chance of falsely rejecting a null hypothesis.
A2, and A3, principally because the events are all Using a Bonferroni adjustment when one is con-
based on data collected from a single study. Conse- ducting these tests would control that overall
quently, the experiment-wise error rate, the proba- Type I error rate. Similar adjustments can be
bility of falsely rejecting any of the three null used to test for main effects and interactions
hypotheses, is at least equal to αe ¼ :05 but poten- in ANOVA and multivariate ANOVA designs
tially as large as :053 ¼ :15: For this reason, we because all that is required to make the adjustment
may apply the Bonferroni procedure by dividing our is that the researcher knows the number of tests
nominal level of α0 ¼ :05 by k ¼ 3 to obtain being performed. The Bonferroni adjustment has
α0 ¼ :0167: Then, rather than comparing the been used to adjust the experiment-wise Type I
p values p1, p2, and p3 to α0 ¼ 05, we compare error rate for multiple tests in a variety of disci-
them to α0 ¼ :0167: The experiment-wise error plines, such as medical, educational, and psycho-
rate is therefore adjusted down so that it is less than logical research, to name a few.
or equal to the original intended nominal level of
α0 ¼ :05:
Recent Research
It should be noted that although the Bonferroni
procedure is often used in the comparison of mul- One of the main criticisms of the Bonferroni proce-
tiple means, because the adjustment is made to the dure is the fact that it overcorrects the overall Type
nominal level, α0 , or to the test’s resulting p value, I error rate, which results in lower statistical power.
the multiple tests could be hypothesis tests of any Many modifications to this procedure have been
population parameters based on any probability proposed over the years to try to alleviate this prob-
distributions. So, for example, one experiment lem. Most of these proposed alternatives can be
could involve a hypothesis test regarding a mean classified either as step-down procedures (e.g., the
and another hypothesis test regarding a variance, Holm method), which test the most significant
and an adjustment based on k ¼ 2 could be made (and, therefore, smallest) p value first, or step-up
to the two tests to maintain the experiment-wise procedures (e.g., the Hochberg method), which
error rate at the nominal level. begin testing with the least significant (and largest)
p value. With each of these procedures, although
the tests are all being conducted concurrently, each
Applications
hypothesis is not tested at the same time or at the
As noted above, the Bonferroni procedure is used same level of significance.
primarily to control the overall α level (i.e., the More recent research has attempted to find
experiment-wise level) when multiple tests are being a divisor between 1 and k that would protect the
performed. Many statistical procedures have been overall Type I error rate at or below the nominal
developed at least partially for this purpose; how- .05 level but closer to that nominal level so as to
ever, most of those procedures have applications have a lesser effect on the power to detect actual
exclusively in the context of making multiple com- differences. This attempt was based on the premise
parisons of group means after finding a significant that making no adjustment to the α level is too lib-
ANOVA result. While the Bonferroni procedure can eral an approach (inflating the experiment-wise
also be used in this context, one of its advantages error rate), and dividing by the number of tests, k,
over other such procedures is that it can also be is too conservative (overadjusting that error rate).
used in other multiple testing situations that do not It was shown that the optimal divisor is directly
initially entail an omnibus test such as ANOVA. determined by the proportion of nonsignificant dif-
For example, although most statistical tests ferences or relationships in the multiple tests being
do not advocate using a Bonferroni adjustment performed. Based on this result, a divisor of
Bootstrapping 101
kð1 qÞ, where q ¼ the proportion of nonsignifi- Simes, R. J. (1986). An improved Bonferroni procedure
cant tests, did the best job of protecting against for multiple tests of significance. Biometrika, 73(3),
Type I errors without sacrificing as much power. 751–754.
Unfortunately, researchers often do not know, Westfall, P. H., Tobias, R. D., Rom, D., Wolfinger, R. D.,
& Hochberg, Y. (1999). Multiple comparisons and
a priori, the number of nonsignificant tests that
multiple tests using SAS. Cary, NC: SAS Institute.
will occur in the collection of tests being per-
formed. Consequently, research has also shown
that a practical choice of the divisor is k/1.5
(rounded to the nearest integer) when the number
of tests is greater than three. This modified Bonfer-
BOOTSTRAPPING
roni adjustment will outperform alternatives in
keeping the experiment-wise error rate at or below The bootstrap is a computer-based statistical tech-
the nominal .05 level and will have higher power nique that is used to obtain measures of precision
than other commonly used adjustments. of parameter estimates. Although the technique is
sufficiently general to be used in time-series analy-
Jamis J. Perrett and Daniel J. Mundfrom sis, permutation tests, cross-validation, nonlinear
regression, and cluster analysis, its most common
See also Analysis of Variance (ANOVA); Hypothesis; use is to compute standard errors and confidence
Multiple Comparison Tests; Newman–Keuls Test and intervals. Introduced by Bradley Efron in 1979,
Tukey Test; Scheffé Test the procedure itself belongs in a broader class of
estimators that use sampling techniques to create
empirical distributions by resampling from the
Further Reqdings
original data set. The goal of the procedure is to
produce analytic expressions for estimators that
Bain, L. J., & Engelhandt, M. (1992). Introduction to are difficult to calculate mathematically. The name
probability and mathematical statistics (2nd ed.). itself derives from the popular story in which
Boston: PWS-Kent. . Baron von Munchausen (after whom Munchausen
Hochberg, Y. (1988). A sharper Bonferroni procedure for
syndrome is also named) was stuck at the bottom
multiple tests of significance. Biometrika, 75(4),
800–802.
of a lake with no alternative but to grab his own
Holland, B. S., & Copenhaver, M. D. (1987). An bootstraps and pull himself to the surface. In a sim-
improved sequentially rejective Bonferroni test ilar sense, when a closed-form mathematical solu-
procedure. Biometrics, 43, 417–423. tion is not easy to calculate, the researcher has no
Holm, S. (1979). A simple sequentially rejective multiple alternative but to ‘‘pull himself or herself up by the
test procedure. Scandinavian Journal of Statistics, 6, bootstraps’’ by employing such resampling techni-
65–70. ques. This entry explores the basic principles and
Hommel, G. (1988). A stagewise rejective multiple test procedures of bootstrapping and examines its
procedure based on a modified Bonferroni test. other applications and limitations.
Biometrika, 75(2), 383–386.
Mundfrom, D. J., Perrett, J., Schaffer, J., Piccone, A., &
Roozeboom, M. A. (2006). Bonferroni adjustments in Basic Principles and Estimation Procedures
tests for regression coefficients. Multiple Linear
Regression Viewpoints, 32(1), 1–6. The fundamental principle on which the procedure
Rom, D. M. (1990). A sequentially rejective test is based is the belief that under certain general con-
procedure based on a modified Bonferroni inequality. ditions, the relationship between a bootstrapped
Biometrika, 77(3), 663–665. estimator and a parameter estimate should be simi-
Roozeboom, M. A., Mundfrom, D. J., & Perrett, J.
lar to the relationship between the parameter esti-
(2008, August). A single-step modified Bonferroni
procedure for multiple tests. Paper presented at the
mate and the unknown population parameter of
Joint Statistical Meetings, Denver, CO. interest. As a means of better understanding the
Shaffer, J. P. (1986). Modified sequentially rejective origins of this belief, Peter Hall suggested a valuable
multiple test procedures. Journal of the American visual: a nested Russian doll. According to Hall’s
Statistical Association, 81(395), 826–831. thought experiment, a researcher is interested in
102 Bootstrapping
determining the number of freckles present on the practice, bootstrap samples are obtained by
outermost doll. However, the researcher is not able a Monte Carlo procedure to draw (with replace-
to directly observe the outermost doll and instead ment) multiple random samples of size n from the
can only directly observe the inner dolls, all of initial sample data set, calculating the parameter of
which resemble the outer doll, but because of their interest for the sample drawn, say θb, and repeating
successively smaller size, each possesses succes- the process k times. Hence, the bootstrap technique
sively fewer freckles. The question facing the allows researchers to generate an estimated sam-
researcher then is how to best use information pling distribution in cases in which they have
from the observable inner dolls to draw conclu- access to only a single sample rather than the entire
sions about the likely number of freckles present population. A minimum value for k is typically
on the outermost doll. To see how this works, assumed to be 100 and can be as many as 10,000,
assume for simplicity that the Russian doll set con- depending on the application.
sists of three parts, the outermost doll and two Peter Bickel and David Freedman defined the
inner dolls. In this case, the outermost doll can be following three necessary conditions if the boot-
thought of as the population, which is assumed to strap is to provide consistent estimates of the
possess n0 freckles; the second doll can be thought asymptotic distribution of a parameter: (1) The
of as the original sample, which is assumed to pos- statistic being bootstrapped must converge weakly
sess n1 freckles; and the third doll can be thought to an asymptotic distribution whenever the data-
of as the bootstrap sample, which is assumed to generating distribution is in a neighborhood of the
possess n2 freckles. A first guess in this situation truth, or in other words, the convergence still
might be to use the observed number of freckles on occurs if the truth is allowed to change within the
the second doll as the best estimate of the likely neighborhood as the sample size grows. (2) The
number of freckles on the outermost doll. Such an convergence to the asymptotic distribution must
estimator will necessarily be biased, however, be uniform in that neighborhood. (3) The asymp-
because the second doll is smaller than the outer- totic distribution must depend on the data-generat-
most doll and necessarily possesses a smaller num- ing process in a continuous way. If all three
ber of freckles. In other words, employing n1 as an conditions hold, then the bootstrap should provide
estimate of n0 necessarily underestimates the true reliable estimates in many different applications.
number of freckles on the outermost doll. This is As a concrete example, assume that we wish to
where the bootstrapped estimator, n2 , reveals its obtain the standard error of the median value for
true value. Because the third doll is smaller than a sample of 30 incomes. The researcher needs to
the second doll by an amount similar to that by create 100 bootstrap samples because this is the
which the second doll is smaller than the outermost generally agreed on number of replications needed
doll, the ratio of the number of freckles on the two to compute a standard error. The easiest way to
inner dolls, n1 : n2 , should be a close approxima- sample with replacement is to take the one data
tion of the ratio of the number of freckles on the set and copy it 500 times for 100 bootstrap sam-
second doll to number on the outer doll, n0 : n1 . ples in order to guarantee that each observation
This in a nutshell is the principle underlying the has an equal likelihood of being chosen in each
bootstrap procedure. bootstrap sample. The researcher then assigns ran-
More formally, the nonparametric bootstrap dom numbers to each of the 15,000 observations
derives from an empirical distribution function,F, ^ ð500 30Þ and sorts each observation by its ran-
which is a random sample of size n from a probabil- dom number assignment from lowest to highest.
ity distribution F: The estimator, θ,^ of the popula- The next step is to make 100 bootstrap samples of
tion parameter θ is defined as some function of the 30 observations each and disregard the other
random sample ðX1 , X2 , . . . Xn Þ. The objective of 12,000 observations. After the 100 bootstrap sam-
the bootstrap is to assess the accuracy of the esti- ples have been made, the median is calculated
mator, θ.^ The bootstrap principle described above from each of the samples, and the bootstrap esti-
states that the relationship between θ^ and θ should mate of the standard error is just the standard
be mimicked by that between θb and θ, ^ where θb is deviation of the 100 bootstrapped medians.
the bootstrap estimator from bootstrap samples. In Although this procedure may seem complicated, it
Bootstrapping 103
is actually relatively easy to write a bootstrapping the 95th percentile is the higher critical value.
program with the use of almost any modern statis- Thus the bootstrapped t interval is
tical program, and in fact, many statistical pro-
grams include a bootstrap command. ½θ^ T:95
boot ^ θ^ T boot s:e:ðθÞ:
s:e:ðθÞ, ^
:05
Besides generating standard error estimates, the
bootstrap is commonly used to directly estimate
confidence intervals in cases in which they would Michael Chernick has pointed out that the biggest
otherwise be difficult to produce. Although a num- drawback of this method is that it is not always
ber of different bootstrapping approaches exist for obvious how to compute the standard errors,
computing confidence intervals, the following dis- ^
Sboot and s:e:ðθÞ.
cussion focuses on two of the most popular. The
first, called the percentile method, is straightfor-
ward and easy to implement. For illustration pur-
Other Applications
poses, assume that the researcher wishes to obtain
a 90% confidence interval. To do so, the In addition to calculating such measures of preci-
researcher would (a) start by obtaining 1,000 sion, the bootstrap procedure has gained favor for
bootstrap samples and the resulting 1,000 boot- a number of other applications. For one, the boot-
strap estimates, θ^b , and (b) order the 1,000 strap is now popular as a method for performing
observed estimates from the smallest to the largest. bias reduction. Bias reduction can be explained as
The 90% confidence interval would then consist follows. The bias of an estimator is the difference
of the specific value bootstrap estimates falling at between the expected value of an estimator, E(θ), ^
the 5th and the 95th percentiles of the sorted dis- ^
and the true value of the parameter, θ, or E(θ θÞ.
tribution. This method typically works well for If an estimator is biased, then this value is non-
large sample sizes because the bootstrap mimics zero, and the estimator is wrong on average. In the
the sampling distribution, but it does not work case of such a biased estimator, the bootstrap prin-
well for small sample size. If the number of obser- ciple is employed such that the bias is estimated by
vations in the sample is small, Bradley Efron and taking the average of the difference between the
Robert Tibshirani have suggested using a bias cor- bootstrap estimate, θ^b , and the estimate from the
rection factor. initial sample, θ^ over the k different bootstrap esti-
The second approach, called the bootstrap t con- mates. Efron defined the bias of the bootstrap as
fidence interval, is more complicated than the per- E(θ^ θ^b ) and suggested reducing the bias of the
centile method, but it is also more accurate. To original estimator, θ,^ by adding estimated bias.
understand this method, it is useful to review a stan- This technique produces an estimator that is close
dard confidence interval, which is defined as to unbiased.
^ θ^ þ tα=2,df s:e:ðθÞ,
½θ^ tα=2,df s:e:ðθÞ, ^ where θ^ is the Recently, the bootstrap has also become
estimate, tα=2,df is the critical value from the t-table popular in different types of regression analysis,
with df degrees of freedom for a ð1 αÞ confidence including linear regression, nonlinear regression,
^ is the standard error of the esti- time-series analysis, and forecasting. With linear
interval, and s:e:ðθÞ
regression, the researcher can either fit the resi-
mate. The idea behind the bootstrap t interval is
duals from the fitted model, or the vector of the
that the critical value is found through bootstrap-
dependent and independent variables can be
ping instead of simply reading the value contained
bootstrapped. If the error terms are not normal
in a published table. Specifically, the bootstrap t is
and the sample size is small, then the researcher
defined as T boot ¼ ðθ^boot θÞ=S ^ boot , where θ^boot is
is able to obtain bootstrapped confidence inter-
the estimate of θ from a bootstrap sample and Sboot vals, like the one described above, instead of
is an estimate of the standard deviation of θ from relying on asymptotic theory that likely does not
the bootstrap sample. The k values of T boot are apply. In nonlinear regression analysis, the boot-
then ordered from lowest to highest, and then, for strap is a very useful tool because there is no
a 90% confidence interval, the value at the 5th per- need to differentiate and an analytic expression
centile is the lower critical value and the value at is not necessary.
104 Box-and-Whisker Plot
box plot, and a discussion of the appropriate uses contains an even number of values, the median
of a box plot. represents an average of the two middle values.
To create the rectangle (or box) associated
with a box plot, one must determine the 1st and
History
3rd quartiles, which represent values (along with
A box plot is one example of a graphical technique the median) that divide all the values into four
used within exploratory data analysis (EDA). EDA sections, each including approximately 25% of
is a statistical method used to explore and under- the values. The 1st (lower) quartile (Q1) repre-
stand data from several angles in social science sents a value that divides the lower 50% of the
research. EDA grew out of work by John Tukey values (those below the median) into two equal
and his associates in the 1960s and was developed sections, and the 3rd (upper) quartile (Q3) repre-
to broadly understand the data, graphically repre- sents a value that divides the upper 50% of the
sent data, generate hypotheses and build models to values (those above the median) into two equal
guide research, add robust measures to an analysis, sections. As with calculating the median, quar-
and aid the researcher in finding the most appro- tiles may represent the average of two values
priate method for analysis. EDA is especially when the number of values below and above the
helpful when the researcher is interested in identi- median is even. The rectangle of a box plot is
fying any unexpected or misleading patterns in the drawn such that it extends from the 1st quartile
data. Although there are many forms of EDA, through the 3rd quartile and thereby represents
researchers must employ the most appropriate the interquartile range (IQR; the distance
form given the specific procedure’s purpose between the 1st and 3rd quartiles). The rectangle
and use. includes the median.
In order to draw the ‘‘whiskers’’ (i.e., lines
extending from the box), one must identify
Definition and Construction
fences, or values that represent minimum and
One of the first steps in any statistical analysis is maximum values that would not be considered
to describe the central tendency and the variabil- outliers. Typically, fences are calculated to be
ity of the values for each variable included in the Q 1:5 IQR (lower fence) and Q3 þ 1:5 IQR
analysis. The researcher seeks to understand the (upper fence). Whiskers are lines drawn by con-
center of the distribution of values for a given necting the most extreme values that fall within
variable (central tendency) and how the rest of the fence to the lines representing Q1 and Q3.
the values fall in relation to the center (variabil- Any value that is greater than the upper fence or
ity). Box plots are used to visually display vari- lower than the lower fence is considered an out-
able distributions through the display of robust lier and is displayed as a special symbol beyond
statistics, or statistics that are more resistant to the whiskers. Outliers that extend beyond the
the presence of outliers in the data set. Although fences are typically considered mild outliers on
there are somewhat different ways to construct the box plot. An extreme outlier (i.e., one that is
box plots depending on the way in which the located beyond 3 times the length of the IQR
researcher wants to display outliers, a box plot from the 1st quartile (if a low outlier) or 3rd
always provides a visual display of the five-num- quartile (if a high outlier) may be indicated by
ber summary. The median is defined as the value a different symbol. Figure 1 provides an illustra-
that falls in the middle after the values for the tion of a box plot.
selected variable are ordered from lowest to Box plots can be created in either a vertical or
highest value, and it is represented as a line in a horizontal direction. (In this entry, a vertical box
the middle of the rectangle within a box plot. As plot is generally assumed for consistency.) They
it is the central value, 50% of the data lie above can often be very helpful when one is attempting
the median and 50% lie below the median. to compare the distributions of two or more data
When the distribution contains an odd number sets or variables on the same scale, in which case
of values, the median represents an actual they can be constructed side by side to facilitate
value in the distribution. When the distribution comparison.
106 Box-and-Whisker Plot
25.00
15
20.00
15.00
10.00
5.00 IQR
0.00
Data set
Figure 1 Box Plot Created With a Data Set and SPSS (an IBM company, formerly called PASWâ Statistics)
Notes: Data set values: 2.0, 2.0, 2.0, 3.0, 3.0, 5.0, 6.0, 6.0, 7.0, 7.0, 8.0, 8.0, 9.0, 10.0, 22.0. Defining features of this box
plot: Median ¼ 6.0; First (lower) quartile ¼ 3.0; Third (upper) quartile ¼ 8.0; Interquartile range (IQR) ¼ 5.0; Lower inner
fence ¼ 4.5; Upper inner fence ¼ 15.5; Range ¼ 20.0; Mild outlier ¼ 22.0.
Steps to Creating a Box Plot lower quartile (Q1 ), upper quartile (Q3 ), and
minimum and maximum values.
The following six steps are used to create a verti-
cal box plot: 2. Calculate the IQR.
3. Determine the lower and upper fences.
1. Order the values within the data set from 4. Using a number line or graph, draw a box
smallest to largest and calculate the median, to mark the location of the 1st and
Box-and-Whisker Plot 107
when using a box plot to display distributions. Box standard. The study also found that compared
plots provide a good visualization of the range and with vertical box plots, box plots positioned hori-
potential skewness of the data. A box plot may pro- zontally were associated with fewer judgment
vide the first step in exploring unexpected patterns errors.
in the distribution because box plots provide a good
indication of how the data are distributed around Sara C. Lewandowski and Sara E. Bolt
the median. Box plots also clearly mark the location
of mild and extreme outliers in the distribution. See also Exploratory Data Analysis; Histogram; Outlier
Other forms of graphical representation that graph
individual values, such as dot plots, may not make
this clear distinction. When used appropriately, box
plots are useful in comparing more than one sample Further Readings
distribution side by side. In other forms of data
Behrens, J. T. (1997). Principles and procedures of
analysis, a researcher may choose to compare data exploratory data analysis. Psychological Methods, 2,
sets using a t test to compare means or an F test to 131–160.
compare variances. However, these methods are Behrens, J. T., Stock, W. A., & Sedgwick, C. E. (1990).
more vulnerable to skewness in the presence Judgment errors in elementary box-plot displays.
of extreme values. These methods must also Communications in Statistics B: Simulation and
meet normality and equal variance assumptions. Computation, 19, 245–262.
Alternatively, box plots can compare the differences Frigge, M., Hoaglin, D. C., & Iglewicz, B. (1989). Some
between variable distributions without the need to implementations of the boxplot. American Statistician,
43, 50–54.
meet certain statistical assumptions.
Massart, D. L., Smeyers-Verbeke, J., Capron, X., &
However, unlike other forms of EDA, box plots
Schlesier, K. (2005). Visual presentation of data by
show less detail than a researcher may need. For means of box plots. LCGC Europe, 18, 215–218.
one, box plots may display only the five-number Moore, D. S. (2001). Statistics: Concepts and
summary. They do not provide frequency measures controversies (5th ed.). New York: W. H. Freeman.
or the quantitative measure of variance and stan- Moore, D. S., & McCabe, P. G. (1998). Introduction to
dard deviation. Second, box plots are not used in the practice of statistics (3rd ed.). New York:
a way that allows the researcher to compare the W. H. Freeman.
data with a normal distribution, which stem plots Ott, R. L., & Longnecker, M. (2001). An introduction to
and histograms do allow. Finally, box plots would statistical methods and data analysis (5th ed.). Pacific
Grove, CA: Wadsworth.
not be appropriate to use with a small sample size
Tukey, J. W. (1977). Exploratory data analysis. Reading,
because of the difficulty in detecting outliers and MA: Addison-Wesley.
finding patterns in the distribution.
Besides taking into account the advantages and
disadvantages of using a box plot, one should con-
sider a few precautions. In a 1990 study conducted
by John T. Behrens and colleagues, participants
frequently made judgment errors in determining b PARAMETER
the length of the box or whiskers of a box plot. In
part of the study, participants were asked to judge The b parameter is an item response theory (IRT)–
the length of the box by using the whisker as based index of item difficulty. As IRT models have
a judgment standard. When the whisker length become an increasingly common way of modeling
was longer than the box length, the participants item response data, the b parameter has become
tended to overestimate the length of the box. a popular way of characterizing the difficulty of an
When the whisker length was shorter than the box individual item, as well as comparing the relative
length, the participants tended to underestimate difficulty levels of different items. This entry
the length of the box. The same result was found addresses the b parameter with regard to different
when the participants judged the length of the IRT models. Further, it discusses interpreting, esti-
whisker by using the box length as a judgment mating, and studying the b parameter.
b Parameter 109
Probability
Probability
0.6 0.6 0.6
Figure 1 Item Characteristic Curves for Example Items, One-Parameter Logistic (1PL or Rasch), Two-Parameter
Logistic (2PL), and Three-Parameter Logistic (3PL) Models
where θj represents an ability-level (or trait-level) While the same general interpretation of the
parameter of the examinee. An interpretation of b parameter as a difficulty parameter still applies
the b parameter follows from its being attached to under the 2PL and 3PL models, the discrimination
the same metric as that assigned to θ. and lower asymptote parameters also contribute to
Usually this metric is continuous and unbounded; the likelihood of a correct response at a given
the indeterminacy of the metric is often handled by ability level.
assigning either the mean of θ (across examinees) or
b (across items) to 0. Commonly b parameters will
Interpretation of the b Parameter
assume values between 3 and 3, with more
extreme positive values representing more difficult Figure 1 provides an illustration of the b parameter
(or infrequently endorsed) items, and more extreme with respect to the 1PL, 2PL, and 3PL models. In
110 b Parameter
this figure, item characteristic curves (ICCs) for a fundamental role in how important measure-
three example items are shown with respect to ment applications, such as item bias (differential
each model. Each curve represents the probability item functioning), test equating, and appropriate-
of a correct response as a function of the latent ness measurement, are conducted and evaluated in
ability level of the examinee. Across all three mod- an IRT framework.
els, it can be generally seen that as the b parameter
increases, the ICC tends to decrease, implying
a lower probability of correct response. Estimating the b Parameter
In the 1PL and 2PL models, the b parameter The b parameter is often characterized as a struc-
has the interpretation of representing the level of tural parameter within an IRT model and as such
the ability or trait at which the respondent has will generally be estimated in the process of fitting
a .50 probability of answering correctly (endorsing an IRT model to item response data. Various esti-
the item). For each of the models, the b mation strategies have been proposed and investi-
parameter also identifies the ability level that cor- gated, some being more appropriate for certain
responds to the inflection point of the ICC, and model types. Under the 1PL model, conditional
thus the b parameter can be viewed as determining maximum likelihood procedures are common. For
the ability level at which the item is maximally all three model types, marginal maximum likeli-
informative. Consequently, the b parameter is hood, joint maximum likelihood, and Bayesian
a critical element in determining where along the estimation procedures have been developed and
ability continuum an item provides its most effec- are also commonly used.
tive estimation of ability, and thus the parameter
has a strong influence on how items are selected
when administered adaptively, such as in a comput- Studying the b Parameter
erized adaptive testing environment.
Under the 1PL model, the b parameter effec- The b parameter can also be the focus of further
tively orders all items from easiest to hardest, and analysis. Models such as the linear logistic test
this ordering is the same regardless of the exam- model and its variants attempt to relate the
inee ability or trait level. This property is no longer b parameter to task components within an item
present in the 2PL and 3PL models, as the ICCs of that account for its difficulty. Such models also
items may cross, implying a different ordering of provide a way in which the b parameter’s estimates
item difficulties at different ability levels. This of items can ultimately be used to validate a test
property can also be seen in the example items in instrument. When the b parameter assumes the
Figure 1 in which the ICCs cross for the 2PL and value expected given an item’s known task compo-
3PL models, but not for the 1PL model. Conse- nents, the parameter provides evidence that the
quently, while the b parameter remains the key item is functioning as intended by the item writer.
factor in influencing the difficulty of the item, it is
Daniel Bolt
not the sole determinant.
An appealing aspect of the b parameter for all See also Differential Item Functioning; Item Analysis;
IRT models is that its interpretation is invariant Item Response Theory; Parameters; Validity of
with respect to examinee ability or trait level. That Measurement
is, its value provides a consistent indicator of item
difficulty whether considered for a population of
high, medium, or low ability. This property is not Further Readings
present in more classical measures of item diffi-
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory
culty (e.g., ‘‘proportion correct’’), which are influ- item response models: A generalized linear and
enced not only by the difficulty of the item, but nonlinear approach. New York: Springer.
also by the distribution of ability in the population Embretson, S. E., & Reise, S. P. (2000). Item response
in which they are administered. This invariance theory for psychologists. Mahwah, NJ: Lawrence
property allows the b parameter to play Erlbaum.
C
a researcher is interested in discovering the vari-
CANONICAL CORRELATION ables (among a set of variables) that best predict
a single variable. The set of variables may be
ANALYSIS termed the independent, or predictor, variables;
the single variable may be considered the depen-
Canonical correlation analysis (CCA) is a multivar- dent, or criterion, variable. CCA is similar, except
iate statistical method that analyzes the relation- that there are multiple dependent variables, as well
ship between two sets of variables, in which each as multiple independent variables. The goal is to
set contains at least two variables. It is the most discover the pattern of variables (on both sides of
general type of the general linear model, with mul- the equation) that combine to produce the highest
tiple regression, multiple analysis of variance, anal- predictive values for both sets. The resulting com-
ysis of variance, and discriminant function analysis bination of variables for each side, then, may be
all being special cases of CCA. thought of as a kind of latent or underlying vari-
Although the method has been available for able that describes the relation between the two
more than 70 years, its use has been somewhat sets of variables.
limited until fairly recently due to its lack of inclu- A simple example from the literature illustrates
sion in common statistical programs and its rather its use: A researcher is interested in investigating
labor-intensive calculations. Currently, however, the relationships among gender, social dominance
many computer programs do include CCA, and orientation, right wing authoritarianism, and three
thus the method has become somewhat more forms of prejudice (stereotyping, opposition to
widely used. equality, and negative affect). Gender, social domi-
This entry begins by explaining the basic logic nance orientation, and right wing authoritarianism
of and defining important terms associated with constitute the predictor set; the three forms of prej-
CCA. Next, this entry discusses the interpretation udice are the criterion set. Rather than computing
of CCA results, statistical assumptions, and limita- three separate multiple regression analyses (viz.,
tions of CCA. Last, it provides an example from the three predictor variables regressing onto one
the literature. criterion variable, one at a time), the researcher
instead computes a CCA on the two sets of vari-
ables to discern the most important predictor(s) of
Basic Logic
the three forms of prejudice. In this example, the
The logic of CCA is fairly straightforward and can CCA revealed that social dominance orientation
be explained best by likening it to a ‘‘multiple- emerged as the overall most important dimension
multiple regression.’’ That is, in multiple regression that underlies all three forms of prejudice.
111
112 Canonical Correlation Analysis
Outliers (that is, data points that are well ‘‘out- meaning (i.e., the latent variable) from the
side’’ a particular distribution of scores) can also obtained results.
significantly attenuate correlation, and their occur-
rence should be minimized or, if possible,
eliminated.
Example in the Literature
For conventional CCA, linear relationships
among variables are required, although CCA algo- In the following example, drawn from the adoles-
rithms for nonlinear relationships are currently cent psychopathology literature, a CCA was per-
available. And, as with other correlation-based formed on two sets of variables, one (the predictor
analyses, low multicollinearity among variables is set) consisting of personality pattern scales of the
assumed. Multicollinearity occurs when variables Millon Adolescent Clinical Inventory, the other
in a correlation matrix are highly correlated with (the criterion set) consisting of various mental dis-
each other, a condition that reflects too much order scales from the Adolescent Psychopathology
redundancy among the variables. As with measure- Scale. The goal of the study was to discover
ment error, high multicollinearity also reduces the whether the two sets of variables were significantly
magnitudes of correlation coefficients. related (i.e., would there exist a multivariate rela-
Stable (i.e., reliable) canonical correlations are tionship?), and if so, how might these two sets be
more likely to be obtained if the sample size is related (i.e., how might certain personality styles
large. It is generally recommended that adequate be related to certain types of mental disorders?).
samples should range from 10 to 20 cases per vari- The first step in interpretation of a CCA is to
able. So, for example, if there are 10 variables, one present the significant variates. In the above exam-
should strive for a minimum of 100 cases (or parti- ple, four significant canonical variates emerged;
cipants). In general, the smaller the sample size, however, only the first two substantially contrib-
the more unstable the CCA. uted to the total amount of variance (together
accounting for 86% of the total variance), and
thus only those two were interpreted. Second, the
Limitation
structure coefficients (canonical loadings) for the
The main limitation associated with CCA is that it two significant variates were described. Structure
is often difficult to interpret the meaning of the coefficients are presented in order of absolute mag-
resulting canonical variates. As others have noted, nitude and are always interpreted as a pair. By
a mathematical procedure that maximizes correla- convention, only coefficients greater than .40 are
tions may not necessarily yield a solution that is interpreted. In this case, for the first canonical vari-
maximally interpretable. This is a serious limita- ate, the predictor set accounted for 48% (Rc2) of
tion and may be the most important reason CCA the variance in the criterion set. Further examina-
is not used more often. Moreover, given that it is tion of the canonical loadings for this variate
a descriptive technique, the problems with inter- showed that the predictor set was represented
preting the meaning of the canonical correlation mostly by the Conformity subscale (factor loading
and associated variates are particularly trouble- of .92) and was related to lower levels of mental
some. For example, suppose a medical sociologist disorder symptoms. The structure coefficients for
found that low unemployment level, high educa- the second canonical variate were then interpreted
tional level, and high crime rate (the predictor in a similar manner.
variables) are associated with good medical out- The last, and often most difficult, step in CCA
comes, low medical compliance, and high medical is to interpret the overall meaning of the analysis.
expenses (the criterion variables). What might this Somewhat similar to factor analysis interpretation,
pattern mean? Perhaps people who are employed, latent variables are inferred from the pattern of
educated, and live in high crime areas have expen- the structure coefficients for the variates. A possi-
sive, but good, health care, although they do not ble interpretation of the above example might be
comply with doctors’ however, compared with that an outgoing conformist personality style is
other multivariate techniques, with CCA there predictive of overall better mental health for ado-
appears to be greater difficulty in extracting the lescents at risk for psychopathology.
114 Case-Only Design
more efficient, precise, and powerful compared Greenland, S. (1993). Basic problems in interaction
with a traditional case–control method. assessment. Environmental Health Perspectives,
Although the case-only design was originally 101(Suppl. 4), 59–66.
created to improve the efficiency, power, and preci- Khoury, M. J., & Flanders, W. D. (1996). Nontraditional
epidemiologic approaches in the analysis of gene-
sion of the study of the gene–environment interac-
environment interaction: Case-control studies with no
tions by examining the prevalence of a specific controls!; American Journal of Epidemiology, 144,
genotype among case subjects only, it is now used 207–213.
to investigate how some other basic characteristics Smeeth, L., Donnan, P. T., & Cook, D. G. (2006). The
that vary slightly (or never vary) over time (e.g., use of primary care databases: Case-control and case-
gender, ethnicity, marital status, social and eco- only designs. Family Practice, 23, 597–604.
nomic status) modify the effect of a time-depen-
dent exposure (e.g., air pollution, extreme
temperatures) on the outcome (e.g., myocardial
infarction, death) in a group of cases only (e.g., CASE STUDY
decedents).
To avoid misinterpretation and bias, some tech- Case study research is a versatile approach to
nical assumptions should be taken into account in research in social and behavioral sciences. Case
case-only studies. The assumption of independence studies consist of detailed inquiry into a bounded
between the susceptibility genotypes and the envi- entity or unit (or entities) in which the researcher
ronmental exposures of interest in the population either examines a relevant issue or reveals phe-
is the most important one that must be considered nomena through the process of examining the
in conducting these studies. In practice, this entity within its social and cultural context. Case
assumption may be violated by some confounding study has gained in popularity in recent years.
factors (e.g., age, ethnic groups) if both exposure However, it is difficult to define because resear-
and genotype are affected. This assumption can be chers view it alternatively as a research design, an
tested by some statistical methods. Some other approach, a method, or even an outcome. This
technical considerations also must be assumed in entry examines case study through different lenses
the application of the case-only model in various to uncover the versatility in this type of research
studies of genetic factors. More details of these approach.
assumptions and the assessment of the gene–envi-
ronment interaction in case-only studies can be
Overview
found elsewhere.
Case study researchers have conducted studies in
Saeed Dastgiri traditional disciplines such as anthropology, eco-
nomics, history, political science, psychology, and
See also Ethics in the Research Process
sociology. Case studies have also emerged in areas
such as medicine, law, nursing, business, adminis-
tration, public policy, social work, and education.
Further Readings
Case studies may be used as part of a larger study
Begg, C. B., & Zhang, Z. F. (1994). Statistical analysis of or as a stand-alone design. Case study research
molecular epidemiology studies employing case series. may be considered a method for inquiry or an
Cancer Epidemiology, Biomarkers & Prevention, 3, evaluation of a bounded entity, program, or sys-
173–175. tem. Case studies may consist of more than one
Chatterjee, N., Kalaylioglu, Z., Shih, J. H., & Gail, entity (unit, thing) or of several cases within one
M. H. (2006). Case-control and case-only designs with
entity, but care must be taken to limit the number
genotype and family history data: Estimating relative
risk, residual familial aggregation, and cumulative
of cases in order to allow for in-depth analysis and
risk. Biometrics, 62, 36–48. description of each case.
Cheng, K. F. (2007). Analysis of case-only studies Researchers who have written about case study
accounting for genotyping error. Annals of Human research have addressed it in different ways,
Genetics, 71, 238–248. depending on their perspectives and points of view.
116 Case Study
Some researchers regard case study as a research pertains only to a particular group of students,
process used to investigate a phenomenon in its such as the study habits of sixth-grade students
real-world setting. Some have considered case who are reading above grade level. In this situa-
study to be a design—a particular logic for setting tion, the case is artificially bounded because the
up the study. Others think of case study as a quali- researcher establishes the criteria for selection of
tative approach to research that includes particular the participants of the study as it relates to the
qualitative methods. Others have depicted it in issue being studied and the research questions
terms of the final product, a written holistic exami- being asked.
nation, interpretation, and analysis of one or more Selecting case study as a research design is
entities or social units. Case study also has been appropriate for particular kinds of questions being
defined in terms of the unit of study itself, or the asked. For example, if a researcher wants to know
entity being studied. Still other researchers con- how a program works or why a program has been
sider that case study research encompasses all carried out in a particular way, case study can be
these notions taken together in relation to the of benefit. In other words, when a study is of an
research questions. exploratory or explanatory nature, case study is
well suited. In addition, case study works well for
understanding processes because the researcher is
Case Study as a Bounded System
able to get close to the participants within their
Although scholars in various fields and disci- local contexts. Case study design helps the
plines have given case study research many forms, researcher understand the complexity of a program
what these various definitions and perspectives or a policy, as well as its implementation and
have in common is the notion from Louis Smith effects on the participants.
that case study is inquiry about a bounded system.
Case studies may be conducted about such entities
Design Decisions
as a single person or several persons, a single class-
room or classrooms, a school, a program within When a researcher begins case study research,
a school, a business, an administrator, or a specific a series of decisions must be made concerning the
policy, and so on. The case study also may be rationale, the design, the purpose, and the type of
about a complex, integrated system, as long as case study. In terms of the rationale for conducting
researchers are able to put boundaries or limits the study, Robert Stake indicated three specific
around the system being researched. forms. Case study researchers may choose to learn
The notion of boundedness may be understood about a case because of their inherent interest in
in more than one way. An entity is naturally the case itself. A teacher who is interested in fol-
bounded if the participants have come together by lowing the work of a student who presents partic-
their own means for their own purposes having ular learning behaviors would conduct an intrinsic
nothing to do with the research. An example case study. In other words, the case is self-selected
would be a group of students in a particular class- due to the inquirer’s interest in the particular
room, a social club that meets on a regular basis, entity. Another example could be an evaluation
or the staff members in a department of a local researcher who may conduct an intrinsic case to
business. The entity is naturally bounded because evaluate a particular program. The program itself
it consists of participants who are together for is of interest, and the evaluator is not attempting
their own common purposes. A researcher is able to compare that program to others.
to study the entity in its entirety for a time frame On the other hand, an instrumental case is one
consistent with the research questions. that lends itself to the understanding of an issue or
In other instances, an entity may be artificially phenomenon beyond the case itself. For example,
bounded through the criteria set by a researcher. In examining the change in teacher practices due to
this instance, the boundary is suggested by select- an educational reform may lead a researcher to
ing from among participants to study an issue par- select one teacher’s classroom as a case, but with
ticular to some, but not all, the participants. For the intent of gaining a general understanding of the
example, the researcher might study an issue that reform’s effects on classrooms. In this instance, the
Case Study 117
case selection is made for further understanding of the way the district structures its financial expendi-
a larger issue that may be instrumental in inform- tures, the design would be holistic because the
ing policy. For this purpose, one case may not be researcher would be exploring the entire school
enough to fully understand the reform issue. The district to see effects. However, if the researcher
researcher may decide to have more than one case examined the constraints placed on each regional
in order to see how the reform plays out in differ- subdivision within the district, each regional subdi-
ent classroom settings or in more than one school. vision would become a subunit of the study as an
If the intent is to study the reform in more than embedded case design. Each regional office would
one setting, then the researcher would conduct be an embedded case and the district itself the
a collective case study. Once the researcher has major case. The researcher would procedurally
made a decision whether to understand the inher- examine each subunit and then reexamine the
ent nature of one case (intrinsic) or to understand major case to see how the embedded subunits
a broader issue that may be represented by one of inform the whole case.
more cases (instrumental or collective), the
researcher needs to decide the type of case that will
illuminate the entity or issue in question.
Multiple Case Design
Multiple cases have been considered by some to
Single-Case Design
be a separate type of study, but Yin considers them
In terms of design, a researcher needs to decide a variant of single-case design and thus similar in
whether to examine a single case, multiple cases, methodological procedures. This design also may
or several cases embedded within a larger system. be called a collective case, cross-case, or compara-
In his work with case study research, Robert Yin tive case study. From the previous example, a team
suggests that one way to think about conducting of researchers might consider studying the finan-
a study of a single case is to consider it as a holistic cial decisions made by the five largest school dis-
examination of an entity that may demonstrate the tricts in the United States since the No Child Left
tenets of a theory (i.e., critical case). If the case is Behind policy went into effect. Under this design,
highly unusual and warrants in-depth explanation, the research team would study each of the largest
it would be considered an extreme or unique case. districts as an individual case, either alone or with
Another use for a single-case design would be embedded cases within each district.
a typical or representative case, in which the After conducting the analysis for each district,
researcher is highlighting an everyday situation. the team would further conduct analyses across
The caution here is for the researcher to be well the five cases to see what elements they may have
informed enough to know what is typical or com- in common. In this way, researchers would poten-
monplace. A revelatory case is one in which tially add to the understanding of the effects of
researchers are able to observe a phenomenon that policy implementation among these large school
had been previously inaccessible. One other type districts. By selecting only the five largest school
of single case is the longitudinal case, meaning one districts, the researchers have limited the findings
in which researchers can examine the same entity to cases that are relatively similar. If researchers
over time to see changes that occur. wanted to compare the implementation under
Whether to conduct a holistic design or an highly varied circumstances, they could use maxi-
embedded design depends on whether researchers mum variation selection, which entails finding
are looking at issues related globally to one entity cases with the greatest variation. For example,
or whether subcases within that entity must be they may establish criteria for selecting a large
considered. For example, in studying the effects of urban school district, a small rural district,
an educational policy on a large, urban public a medium-sized district in a small city, and perhaps
school district, researchers could use either type of a school district that serves a Native American
design, depending on what they deem important population on a reservation. By deliberately select-
to examine. If researchers focused on the global ing from among cases that had potential for differ-
aspect of the policy and how it may have changed ing implementation due to their distinct contexts,
118 Case Study
the researchers expect to find as much variation as Case Study Approaches to Data
possible.
No matter what studies that involve more than Data Collection
a single case are called, what case studies have in One of the reasons that case study is such a ver-
common is that findings are presented as individ- satile approach to research is that both quantita-
ual portraits that contribute to our understanding tive and qualitative data may be used in the study,
of the issues, first individually and then collec- depending on the research questions asked. While
tively. One question that often arises is whether many case studies have qualitative tendencies, due
the use of multiple cases can represent a form of to the nature of exploring phenomena in context,
generalizability, in that researchers may be able to some case study researchers use surveys to find out
show similarities of issues across the cases. The demographic and self-report information as part
notion of generalizability in an approach that of the study. Researchers conducting case studies
tends to be qualitative in nature may be of concern tend also to use interviews, direct observations,
because of the way in which one selects the cases participant observation, and source documents as
and collects and analyzes the data. For example, in part of the data analysis and interpretation. The
statistical studies, the notion of generalizability documents may consist of archival records, arti-
comes from the form of sampling (e.g., random- facts, and websites that provide information about
ized sampling to represent a population and a con- the phenomenon in the context of what people
trol group) and the types of measurement tools make and use as resources in the setting.
used (e.g., surveys that use Likert-type scales and
therefore result in numerical data). In case study
Data Analysis
research, however, Yin has offered the notion that
the different cases are similar to multiple experi- The analytic role of the researcher is to system-
ments in which the researcher selects among simi- atically review these data, first making a detailed
lar and sometimes different situations to verify description of the case and the setting. It may be
results. In this way, the cases become a form of helpful for the researcher to outline a chronology
generalizing to a theory, either taken from the liter- of actions and events, although in further analysis,
ature or uncovered and grounded in the data. the chronology may not be as critical to the study
as the thematic interpretations. However, it may
be useful in terms of organizing what may other-
wise be an unwieldy amount of data.
Nature of the Case Study In further analysis, the researcher examines the
data, one case at a time if multiple cases are
Another decision to be made in the design of involved, for patterns of actions and instances of
a study is whether the purpose is primarily descrip- issues. The researcher first notes what patterns are
tive, exploratory, or explanatory. The nature of constructed from one set of data within the case
a descriptive case study is one in which the and then examines subsequent data collected
researcher uses thick description about the entity within the first case to see whether the patterns are
being studied so that the reader has a sense of hav- consistent. At times, the patterns may be evident
ing ‘‘been there, done that,’’ in terms of the phe- in the data alone, and at other times, the patterns
nomenon studied within the context of the may be related to relevant studies from the litera-
research setting. Exploratory case studies are those ture. If multiple cases are involved, a cross-case
in which the research questions tend to be of the analysis is then conducted to find what patterns
what can be learned about this issue type. The goal are consistent and under what conditions other
of this kind of study is to develop working hypoth- patterns are apparent.
eses about the issue and perhaps to propose further
research. An explanatory study is more suitable
Reporting the Case
for delving into how and why things are happen-
ing as they are, especially if the events and people While case study research has no predetermined
involved are to be observed over time. reporting format, Stake has proposed an approach
Categorical Data Analysis 119
for constructing an outline of the report. He sug- Gomm, R., Hammersley, M., & Foster, P. (2000). Case
gests that the researcher start the report with study method: Key issues, Key texts. Thousand Oaks,
a vignette from the case to draw the reader into CA: Sage.
time and place. In the next section, the researcher Merriam, S. B. (1998). Qualitative research and case
study applications in education. San Francisco: Jossey-
indentifies the issue studied and the methods for
Bass.
conducting the research. The next part is a full Stake, R. E. (1995). Art of case study research. Thousand
description of the case and context. Next, in Oaks, CA: Sage.
describing the issues of the case, the researcher can Stake, R. E. (2005). Multiple case study analysis. New
build the complexity of the study for the reader. York: Guilford.
The researcher uses evidence from the case and Yin, R. K. (2003). Case study research. Thousand Oaks,
may relate that evidence to other relevant research. CA: Sage.
In the next section, the researcher presents the so
what of the case—a summary of claims made from
the interpretation of data. At this point, the
researcher may end with another vignette that CATEGORICAL DATA ANALYSIS
reminds the reader of the complexity of the case in
terms of realistic scenarios that readers may then A categorical variable consists of a set of non-
use as a form of transference to their own settings overlapping categories. Categorical data are
and experiences. counts for those categories. The measurement
scale of a categorical variable is ordinal if the
categories exhibit a natural ordering, such as
opinion variables with categories from ‘‘strongly
Versatility disagree’’ to ‘‘strongly agree.’’ The measurement
Case study research demonstrates its utility in that scale is nominal if there is no inherent ordering.
the researcher can explore single or multiple phe- The types of possible analysis for categorical
nomena or multiple examples of one phenomenon. data depend on the measurement scale.
The researcher can study one bounded entity in
a holistic fashion or multiple subunits embedded Types of Analysis
within that entity. Through a case study approach,
researchers may explore particular ways that parti- When the subjects measured are cross-classified on
cipants conduct themselves in their localized con- two or more categorical variables, the table of
texts, or the researchers may choose to study counts for the various combinations of categories
processes involving program participants. Case is a contingency table. The information in a contin-
study research can shed light on the particularity gency table can be summarized and further ana-
of a phenomenon or process while opening up ave- lyzed through appropriate measures of association
nues of understanding about the entity or entities and models as discussed below. These measures
involved in those processes. and models differentiate according to the nature of
the classification variables (nominal or ordinal).
LeAnn Grogan Putney Most studies distinguish between one or more
response variables and a set of explanatory vari-
ables. When the main focus is on the association
and interaction structure among a set of response
Further Readings variables, such as whether two variables are condi-
tionally independent given values for the other
Creswell, J. W. (2007). Qualitative inquiry and research
variables, loglinear models are useful, as described
design: Choosing among five approaches. Thousand
Oaks, CA: Sage.
in a later section. More commonly, research ques-
Creswell, J. W., & Maietta, R. C. (2002). Qualitative tions focus on effects of explanatory variables on
research. In D. C. Miller & N. J. Salkind (Eds.), a categorical response variable. Those explanatory
Handbook of research design and social measurement variables might be categorical, quantitative, or of
(pp. 162–163). Thousand Oaks, CA: Sage. both types. Logistic regression models are then of
120 Categorical Data Analysis
particular interest. Initially such models were way this is done is to introduce a random effect in
developed for binary (success–failure) response the model to represent each cluster, thus extending
variables. They describe the logit, which is the GLM to a generalized linear mixed model, the
log½PðY ¼ 1Þ=PðY ¼ 2Þ, using the equation mixed referring to the model’s containing both
random effects and the usual sorts of fixed effects.
log½PðY ¼ 1Þ=PðY ¼ 2Þ ¼ a þ β1 x1 þ β2 x2
þ þ βp xp , Two-Way Contingency Tables
where Y is the binary response variable and Two categorical variables are independent if the
x1 , . . . , xp the set of the explanatory variables. The probability of response in any particular category
logistic regression model was later extended to of one variable is the same for each category of the
nominal and ordinal response variables. For a nom- other variable. The most well-known result on
inal response Y with J categories, the model simul- two-way contingency tables is the test of the null
taneously describes hypothesis of independence, introduced by Karl
Pearson in 1900. If X and Y are two categorical
log½PðY ¼ 1Þ=PðY ¼ JÞ, variables with I and J categories, respectively, then
log½PðY ¼ 2Þ=PðY ¼ JÞ, . . . , their cross-classification leads to a I × J table of
observed frequencies n ¼ ðnij Þ. Under this hypoth-
log½PðY ¼ J 1Þ=PðY ¼ JÞ:
esis, the expected cell frequencies are values that
For ordinal responses, a popular model uses have the same marginal totals as the observed
explanatory variables to predict a logit defined in counts but perfectly satisfy the hypothesis. They
terms of a cumulative probability, equal mij ¼ nπi þ π þ j , i ¼ 1; . . . ; I; jP ¼ 1; . . . ; J,
where n is the total sample size (n ¼ i;j nij ) and
log½PðY ≤ jÞ=PðY > jÞ; j ¼ 1; 2; . . . ; J 1: πi þ (π þ j ) is the ith row (jth column) marginal of
the underlying probabilities matrix π ¼ ðπij Þ. Then
For categorical data, the binomial and multino- the corresponding maximum likelihood (ML) esti-
mial distributions play the central role that the n n
mates equal m ^ ij ¼ npi þ p þ j ¼ i þn þ j , where pij
normal does for quantitative data. Models for cat-
denotes the sample proportion in cell (i, j). The
egorical data assuming the binomial or multino-
hypothesis of independence is tested through Pear-
mial were unified with standard regression and
son’s chi-square statistic,
analysis of variance (ANOVA) models for quanti-
tative data assuming normality were unified X ðnij m^ ij Þ2
through the introduction of the generalized linear X2 ¼ : (1)
model (GLM). This very wide class of models can
i;j m
^ ij
incorporate data assumed to come from any of
a variety of standard distributions (such as the nor- The p value is the right-tail probability above
mal, binomial, and Poisson). The GLM relates the observed X2 value. The distribution of X2
a function of the mean (such as the log or logit of under the null hypothesis is approximated by
the mean) to explanatory variables with a linear a χ2ðI1ÞðJ1Þ , provided that the individual expected
predictor. Certain GLMs for counts, such as Pois- cell frequencies are not too small. In fact, Pearson
son regression models, relate naturally to log linear claimed that the associated degrees of freedom (df)
and logistic models for binomial and multinomial were IJ 1, and R. A. Fisher corrected this in
responses. 1922. Fisher later proposed a small-sample test of
More recently, methods for categorical data independence for 2 × 2 tables, now referred to as
have been extended to include clustered data, for Fisher’s exact test. This test was later extended to
which observations within each cluster are allowed I × J tables as well as to more complex hypotheses
to be correlated. A very important special case is in both two-way and multiway tables. When a con-
that of repeated measurements, such as in a longi- tingency table has ordered row or column cate-
tudinal study in which each subject provides a clus- gories (ordinal variables), specialized methods can
ter of observations taken at different times. One take advantage of that ordering.
Categorical Data Analysis 121
Ultimately more important than mere testing of Models for Two-Way Contingency Tables
significance is the estimation of the strength of the
association. For ordinal data, measures can incor- Independence between the classification variables
porate information about the direction (positive or X and Y (i.e., mij ¼ nπi þ π þ j , for all i and j) can
negative) of the association as well. equivalently be expressed in terms of a log linear
More generally, models can be formulated that model as
are more complex than independence, and
logðmij Þ ¼ λ þ λX Y
i þ λj ,
expected frequencies mij can be estimated under
the constraint that the model holds. If m ^ ij are the i ¼ 1; . . . ; I; j ¼ 1; . . . ; J:
corresponding maximum likelihood estimates,
then, to test the hypothesis that the model holds, The more general model that allows association
one can use the Pearson statistic (Equation 1) or between the variables is
the statistic that results from the standard statisti-
cal approach of conducting a likelihood-ratio test, logðmij Þ ¼ λ þ λX Y XY
i þ λj þ λij ,
ð3Þ
which is i ¼ 1; . . . ; I; j ¼ 1; . . . ; J:
X
nij Loglinear models describe the way the categori-
G2 ¼ 2 nij ln : ð2Þ
i;j m
^ ij cal variables and their association influence the
count in each cell of the contingency table. They
Under the null hypothesis, both statistics have the can be considered as a discrete analogue of
same large-sample chi-square distribution. ANOVA. The two-factor interaction terms relate
The special case of the 2 × 2 table occurs com- to odds ratios describing the association. As in
monly in practice, for instance for comparing two ANOVA models, some parameters are redundant
groups on a success/fail–type outcome. In a 2 × 2 in these specifications, and software reports esti-
table, the basic measure of association is the odds mates by assuming certain constraints.
ratio. For the probability table The general model (Equation 3) does not
impose any structure on the underlying associa-
π11 π12 tion, and so it fits the data perfectly. Associations
π21 π22 can be modeled through association models. The
simplest such model, the linear-by-linear associa-
the odds ratio is defined as θ ¼ ππ11 ππ22 . Indepen- tion model, is relevant when both classification
12 21 variables are ordinal. It replaces the interaction
dence corresponds to θ ¼ 1. Inference about the
term λXY
ij by the product φμi νj , where μi and νj
odds ratio can be based on the fact that for large
are known scores assigned to the row and column
samples,
categories, respectively. This model,
1 1 1 1 logðmij Þ ¼ λ þ λX Y
^
logðθÞ N logðθÞ; þ þ þ : i þ λj þ φμi νj ,
e n11 n12 n21 n22 ð4Þ
i ¼ 1; . . . ; I; j ¼ 1; . . . ; J,
The odds ratio relates to the relative risk r. In par-
has only one parameter more than the indepen-
ticular, if we assume that the rows of the above
dence model, namely φ. Consequently, the associ-
2 × 2 table represent two independent groups of
ated df are ðI 1ÞðJ 1Þ 1, and once it holds,
subjects (A and B) and the columns correspond to
independence can be tested conditionally on it by
presence/absence of a disease, then the relative risk
π testing φ ¼ 0 via a more powerful test with
for this disease is defined as r ¼ πA , where
B df ¼ 1. The linear-by-linear association model
πA ¼ ππ11 is the probability of disease for the first (Equation 4) can equivalently be expressed in
1þ
group and πB is defined analogously. Since terms of the ðI 1ÞðJ 1Þ local odds ratios
π π
1πB
θ ¼ r 1π , it follows that θ ≈ r whenever πA and θij ¼ π ij iþ1;jþ1
π ðI ¼ 1; . . . ; I 1; j ¼ 1; . . . ; J 1Þ,
A i;jþ1 iþ1;j
πB are close to 0. defined by adjacent rows and columns of the table:
122 Categorical Data Analysis
θij ¼ exp½φðμi þ 1 μi Þðνj þ 1 νj Þ; methods that are available for ordinary regression
ð5Þ models, such as stepwise selection methods and fit
i ¼ 1; . . . ; I 1; j ¼ 1; . . . ; J 1:
indices such as Akaike Information Criterion.
Loglinear models for multiway tables can include
With equally spaced scores, all the local odds higher order interactions up to the order equal to the
ratios are identical, and the model is referred to as dimension of the table. Two-factor terms describe
uniform association. More general models treat conditional association between two variables, three-
one or both sets of scores as parameters. Asso- factor terms describe how the conditional association
ciation models have been mainly developed by varies among categories of a third variable, and so
L. Goodman. forth. CA has also been extended to higher dimen-
Another popular method for studying the pat- sional tables, leading to multiple CA.
tern of association between the row and column Historically, a common way to analyze higher
categories of a two-way contingency table is corre- way contingency tables was to analyze all the
spondence analysis (CA). It is mainly a descriptive two-way tables obtained by collapsing the table
method. CA assigns optimal scores to the row and over the other variables. However, the two-way
column categories and plots these scores in two or associations can be quite different from condi-
three dimensions, providing thus a reduced rank tional associations in which other variables are
display of the underlying association. controlled. The association can even change
The special case of square I × I contingency direction, a phenomenon known as Simpson’s
tables with the same categories for the rows and paradox. Conditions under which tables can be
the columns occurs with matched-pairs data. For collapsed are most easily expressed and visual-
example, such tables occur in the study of rater ized using graphical models that portray each
agreement and in the analysis of social mobility. A variable as a node and a conditional association
condition of particular interest for such data is mar- as a connection between two nodes. The patterns
ginal homogeneity, that πi þ ¼ π þ i , i ¼ 1; . . . ; I. of associations and their strengths in two-way or
For the 2 × 2 case of binary matched pairs, the test multiway tables can also be illustrated through
comparing the margins using the chi-square statistic special plots called mosaic plots.
ðn12 n21 Þ2 =ðn12 þ n21 Þ is called McNemar’s test.
have facility for fitting GLMs, and most of the male ¼ 0, female ¼ 1) that do not contain mathe-
standard methods for categorical data can be matical information beyond the frequency counts
viewed as special cases of such modeling. Bayesian related to group membership. Instead, categorical
analysis of categorical data can be carried out variables often provide valuable social-oriented
through WINBUGS. Specialized software such as information that is not quantitative by nature (e.g.,
the programs StatXact and LogXact, developed by hair color, religion, ethnic group).
Cytel Software, are available for small-sample In the hierarchy of measurement levels, categori-
exact methods of inference for contingency tables cal variables are associated with the two lowest vari-
and for logistic regression parameters. able classification orders, nominal or ordinal scales,
depending on whether the variable groups exhibit an
Maria Kateri and Alan Agresti intrinsic ranking. A nominal measurement level con-
sists purely of categorical variables that have no
See also Categorical Variable; Correspondence Analysis;
ordered structure for intergroup comparison. If the
General Linear Model; R; SAS; Simpson’s Paradox;
categories can be ranked according to a collectively
SPSS
accepted protocol (e.g., from lowest to highest), then
these variables are ordered categorical, a subset of
Further Readings the ordinal level of measurement.
Categorical variables at the nominal level of mea-
Agresti, A. (2002). Categorical data analysis (2nd ed.).
surement have two properties. First, the categories
New York: Wiley.
Agresti, A. (2007). An introduction to categorical data are mutually exclusive. That is, an object can belong
analysis (2nd ed.). New York: Wiley. to only one category. Second, the data categories
Agresti, A. (2010). Analysis of Ordinal Categorical Data. have no logical order. For example, researchers can
New York: Wiley. measure research participants’ religious back-
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. grounds, such as Jewish, Protestant, Muslim, and so
(1975). Discrete Multivariate Analysis: Theory and on, but they cannot order these variables from
Practice. Cambridge: MIT Press. lowest to highest. It should be noted that when cate-
Congdon, P. (2005). Bayesian Models for Categorical gories get numeric labels such as male ¼ 0 and
Data. New York: Wiley.
female ¼ 1 or control group ¼ 0 and treatment
Goodman, L. A. (1986). Some useful extensions of the
group ¼ 1, the numbers are merely labels and do not
usual correspondence analysis and the usual log-linear
models approach in the analysis of contingency tables indicate one category is ‘‘better’’ on some aspect than
with or without missing entries. International another. The numbers are used as symbols (codes)
Statistical Review, 54, 243-309. and do not reflect either quantities or a rank order-
Kateri, M. (2008). Categorical data. In S. Kotz (Ed.), ing. Dummy coding is the quantification of a variable
Encyclopedia of statistical sciences (2nd ed.). with two categories (e.g., boys, girls). Dummy cod-
Hoboken, NJ: Wiley-Interscience. ing will allow the researcher to conduct specific anal-
yses such as the point-biserial correlation coefficient,
in which a dichotomous categorical variable is
Websites
related to a variable that is continuous. One example
Cytel Software: http://www.cytel.com of the use of point-biserial correlation is to compare
WINBUGS: http://www.mrc-bsu.cam.ac.uk/bugs/ males with females on a measure of mathematical
winbugs/contents.shtml ability.
Categorical variables at the ordinal level of
measurement have the following properties: (a) the
data categories are mutually exclusive, (b) the data
CATEGORICAL VARIABLE categories have some logical order, and (c) the data
categories are scaled according to the amount of
Categorical variables are qualitative data in which a particular characteristic. Grades in courses (i.e.,
the values are assigned to a set of distinct groups or A, B, C, D, and F) are an example. The person
categories. These groups may consist of alpha- who earns an A in a course has a higher level of
betic (e.g., male, female) or numeric labels (e.g., achievement than one who gets a B, according to
124 Causal-Comparative Design
the criteria used for measurement by the course most frequent category or categories if there is
instructor. However, one cannot assume that the more than one mode), but at the ordinal level,
difference between an A and a B is the same as the the median or point below which 50% of the
difference between a B and a C. Similarly, scores fall is also used. The chi-square distribu-
researchers might set up a Likert-type scale to tion is used for categorical data at the nominal
measure level of satisfaction with one’s job and level. Observed frequencies in each category are
assign a 5 to indicate extremely satisfied, 4 to indi- compared with the theoretical or expected fre-
cate very satisfied, 3 to indicate moderately satis- quencies. Types of correlation coefficients that
fied, and so on. A person who gives a rating of 5 use categorical data include point biserial; Spear-
feels more job satisfaction than a person who gives man rho, in which both variables are at the ordi-
a rating of 3, but it has no meaning to say that one nal level; and phi, in which both variables are
person has 2 units more satisfaction with a job dichotomous (e.g., boys vs. girls on a yes–no
than another has or exactly how much more satis- question). Categorical variables can also be used
fied one is with a job than another person is. in various statistical analyses such as t tests,
In addition to verbal descriptions, categorical analysis of variance, multivariate analysis of var-
variables are often presented visually using iance, simple and multiple regression analysis,
tables and charts that indicate the group fre- and discriminant analysis.
quency (i.e., the number of values in a given cat-
egory). Contingency tables show the number of Karen D. Multon and Jill S. M. Coleman
counts in each category and increase in complex-
See also Bar Chart; Categorical Data Analysis; Chi-
ity as more attributes are examined for the same
Square Test; Levels of Measurement; Likert Scaling;
object. For example, a car can be classified
Nominal Scale; Ordinal Scale; Pie Chart; Variable
according to color, manufacturer, and model.
This information can be displayed in a contin-
gency table showing the number of cars that Further Readings
meet each of these characteristics (e.g., the num-
ber of cars that are white and manufactured by Siegel, A. F., & Morgan, C. J. (1996). Statistics and data
General Motors). This same information can be analysis: An introduction (2nd ed.). New York: Wiley.
Simonoff, J. S. (2003). Analyzing categorical data. New
expressed graphically using a bar chart or pie
York: Springer.
chart. Bar charts display the data as elongated
bars with lengths proportional to category fre-
quency, with the category labels typically being
the x-axis and the number of values the y-axis.
On the other hand, pie charts show categorical CAUSAL-COMPARATIVE DESIGN
data as proportions of the total value or as a per-
centage or fraction. Each category constitutes A causal-comparative design is a research design
a section of a circular graph or ‘‘pie’’ and repre- that seeks to find relationships between indepen-
sents a subset of the 100% or fractional total. In dent and dependent variables after an action or
the car example, if 25 cars out of a sample of event has already occurred. The researcher’s goal
100 cars were white, then 25%, or one quarter, is to determine whether the independent variable
of the circular pie chart would be shaded, and affected the outcome, or dependent variable, by
the remaining portion of the chart would be comparing two or more groups of individuals.
shaded alternative colors based on the remaining There are similarities and differences between
categorical data (i.e., cars in colors other than causal-comparative research, also referred to as ex
white). post facto research, and both correlational and
Specific statistical tests that differ from other experimental research. This entry discusses these
quantitative approaches are designed to account differences, as well as the benefits, process, limita-
for data at the categorical level. The only mea- tions, and criticism of this type of research design.
sure of central tendency appropriate for categor- To demonstrate how to use causal-comparative
ical variables at the nominal level is mode (the research, examples in education are presented.
Causal-Comparative Design 125
Comparisons With Correlational Research independent variable that is the focus of the study.
Another similarity is that the goal of both types of
Many similarities exist between causal-compara- research is to determine what effect the indepen-
tive research and correlational research. Both dent variable may or may not have on the depen-
methods are useful when experimental research dent variable or variables.
has been deemed impossible or unethical as the While the premises of the two research designs
research design for a particular question. Both are comparable, there are vast differences between
causal-comparative and correlational research causal-comparative research and experimental
designs attempt to determine relationships among research. First and foremost, causal-comparative
variables, but neither allows for the actual manip- research occurs after the event or action has been
ulation of these variables. Thus, neither can defini- completed. It is a retrospective way of determining
tively state that a true cause-and-effect relationship what may have caused something to occur. In true
occurred between these variables. Finally, neither experimental research designs, the researcher
type of design randomly places subjects into con- manipulates the independent variable in the exper-
trol and experimental groups, which limits the imental group. Because the researcher has more
generalizability of the results. control over the variables in an experimental
Despite similarities, there are distinct differences research study, the argument that the independent
between causal-comparative and correlational variable caused the change in the dependent vari-
research designs. In causal-comparative research, able is much stronger. Another major distinction
the researcher investigates the effect of an indepen- between the two types of research is random sam-
dent variable on a dependent variable by compar- pling. In causal-comparative research, the research
ing two or more groups of individuals. For subjects are already in groups because the action
example, an educational researcher may want to or event has already occurred, whereas subjects in
determine whether a computer-based ACT pro- experimental research designs are randomly
gram has a positive effect on ACT test scores. In selected prior to the manipulation of the variables.
this example, the researcher would compare the This allows for wider generalizations to be made
ACT scores from a group of students that com- from the results of the study.
pleted the program with scores from a group that Table 1 breaks down the causal-comparative,
did not complete the program. In correlational correlational, and experimental methods in refer-
research, the researcher works with only one ence to whether each investigates cause-effect and
group of individuals. Instead of comparing two whether the variables can be manipulated. In addi-
groups, the correlational researcher examines the tion, it notes whether groups are randomly
effect of one or more independent variables on the assigned and whether the methods study groups or
dependent variable within the same group of sub- individuals.
jects. Using the same example as above, the corre-
lational researcher would select one group of
subjects who have completed the computer-based
ACT program. The researcher would use statistical When to Use Causal-Comparative
measures to determine whether there was a positive Research Designs
relationship between completion of the ACT pro- Although experimental research results in more
gram and the students’ ACT scores. compelling arguments for causation, there are
many times when such research cannot, or should
not, be conducted. Causal-comparative research
Comparisons With Experimental Research
provides a viable form of research that can be con-
A few aspects of causal-comparative research par- ducted when other methods will not work. There
allel experimental research designs. Unlike correla- are particular independent variables that are not
tional research, both experimental research and capable of being manipulated, including gender,
causal-comparative research typically compare ethnicity, socioeconomic level, education level, and
two or more groups of subjects. Research subjects religious preferences. For instance, if researchers
are generally split into groups on the basis of the intend to examine whether ethnicity affects
126 Causal-Comparative Design
self-esteem in a rural high school, they cannot Furthermore, causal-comparative research may
manipulate a subject’s ethnicity. This indepen- prove to be the design of choice even when experi-
dent variable has already been decided, so the mental research is possible. Experimental research
researchers must look to another method of is both time-consuming and costly. Many school
determining cause. In this case, the researchers districts do not have the resources to conduct a full-
would group students according to their ethnic- scale experimental research study, so educational
ity and then administer self-esteem assessments. leaders may choose to do a causal-comparative
Although the researchers may find that one eth- study. For example, the leadership might want to
nic group has higher scores than another, they determine whether a particular math curriculum
must proceed with caution when interpreting the would improve math ACT scores more effectively
results. In this example, it might be possible that than the curriculum already in place in the school
one ethnic group is also from a higher socioeco- district. Before implementing the new curriculum
nomic demographic, which may mean that the throughout the district, the school leaders might
socioeconomic variable affected the assessment conduct a causal-comparative study, comparing
scores. their district’s math ACT scores with those from
Some independent variables should not be a school district that has already used the curricu-
manipulated. In educational research, for example, lum. In addition, causal-comparative research is
ethical considerations require that the research often selected as a precursor to experimental
method not deny potentially useful services to stu- research. In the math curriculum example, if the
dents. For instance, if a guidance counselor wanted causal-comparative study demonstrates that the cur-
to determine whether advanced placement course riculum has a positive effect on student math ACT
selection affected college choice, the counselor scores, the school leaders may then choose to con-
could not ethically force some students to take cer- duct a full experimental research study by piloting
tain classes and prevent others from taking the the curriculum in one of the schools in the district.
same classes. In this case, the counselor could still
compare students who had completed advanced
Conducting Causal Comparative Research
placement courses with those who had not, but
causal conclusions are more difficult than with an The basic outline for conducting causal compara-
experimental design. tive research is similar to that of other research
Causal-Comparative Design 127
Collect Data
Inferential Descriptive
Statistics Statistics
Report Findings
designs. Once the researcher determines the focus because the independent variable of socioeconomic
of the research and develops hypotheses, he or she status cannot be manipulated.
selects a sample of participants for both an experi- Because many factors may influence the depen-
mental and a control group. Depending on the dent variable, the researcher should be aware of,
type of sample and the research question, the and possibly test for, a variety of independent vari-
researcher may measure potentially confounding ables. For instance, if the researcher wishes to
variables to include them in eventual analyses. The determine whether socioeconomic level affects
next step is to collect data. The researcher then a student’s decision to drop out of high school, the
analyzes the data, interprets the results, and researcher may also want to test for other poten-
reports the findings. Figure 1 illustrates this tial causes, such as parental support, academic
process. ability, disciplinary issues, and other viable
options. If other variables can be ruled out, the
case for socioeconomic level’s influencing the drop-
Determine the Focus of Research
out rate will be much stronger.
As in other research designs, the first step in
conducting a causal-comparative research study is
to identify a specific research question and gener- Participant Sampling and
ate a hypothesis. In doing so, the researcher identi- Threats to Internal Validity
fies a dependent variable, such as high dropout
rates in high schools. The next step is to explore In causal-comparative research, two or more
reasons the dependent variable has occurred or is groups of participants are compared. These groups
occurring. In this example, several issues may are defined by the different levels of the indepen-
affect dropout rates, including such elements as dent variable(s). In the previous example, the
parental support, socioeconomic level, gender, eth- researcher compares a group of high school drop-
nicity, and teacher support. The researcher will outs with a group of high school students who
need to select which issue is of importance to his have not dropped out of school. Although this is
or her research goals. One hypothesis might be, not an experimental design, causal-comparative
‘‘Students from lower socioeconomic levels drop researchers may still randomly select participants
out of high school at higher rates than students within each group. For example, a researcher may
from higher socioeconomic levels.’’ Thus, the inde- select every fifth dropout and every fifth high
pendent variable in this scenario would be socio- school student. However, because the participants
economic levels of high school students. are not randomly selected and placed into groups,
It is important to remember that many factors internal validity is threatened. To strengthen the
affect dropout rates. Controlling for such factors research design and counter threats to internal
in causal-comparative research is discussed later in validity, the researcher might choose to impose the
this entry. Once the researcher has identified the selection techniques of matching, using homoge-
main research problem, he or she operationally neous subgroups, or analysis of covariance
defines the variables. In the above hypothesis, the (ANCOVA), or both.
dependent variable of high school dropout rates is
Matching
fairly self-explanatory. However, the researcher
would need to establish what constitutes lower One method of strengthening the research sam-
socioeconomic levels and higher socioeconomic ple is to select participants by matching. Using this
levels. The researcher may also wish to clarify the technique, the researcher identifies one or more
target population, such as what specific type of characteristics and selects participants who have
high school will be the focus of the study. Using these characteristics for both the control and the
the above example, the final research question experimental groups. For example, if the
might be, ‘‘Does socioeconomic status affect drop- researcher wishes to control for gender and grade
out rates in the Appalachian rural high schools in level, he or she would ensure that both groups
East Tennessee?’’ In this case, causal comparative matched on these characteristics. If a male 12th-
would be the most appropriate method of research grade student is selected for the experimental
Causal-Comparative Design 129
group, then a male 12th-grade student must be information as possible, especially if the researcher
selected for the control group. In this way the is planning to use the control method of matching.
researcher is able to control these two extraneous
variables.
Data Analysis and Interpretation
Comparing Homogeneous Subgroups Once the data have been collected, the
Another control technique used in causal-com- researcher analyzes and interprets the results.
parative research is to compare subgroups that are Although causal-comparative research is not true
clustered according to a particular variable. For experimental research, there are many methods of
example, the researcher may choose to group and analyzing the resulting data, depending on the
compare students by grade level. He or she would research design. It is important to remember that
then categorize the sample into subgroups, com- no matter what methods are used, causal-compar-
paring 9th-grade students with other 9th-grade ative research does not definitively prove cause-
students, 10th-grade students with other 10th- and-effect results. Nevertheless, the results will
grade students, and so forth. Thus, the researcher provide insights into causal relationships between
has controlled the sample for grade level. the variables.
Using the ANCOVA statistical method, the When using inferential statistics in causal-com-
researcher is able to adjust previously dispropor- parative research, the researcher hopes to demon-
tionate scores on a pretest in order to equalize the strate that a relationship exists between the
groups on some covariate (control variable). The independent and dependent variables. Again, the
researcher may want to control for ACT scores appropriate method of analyzing data using this
and their impact on high school dropout rates. In type of statistics is determined by the design of the
comparing the groups, if one group’s ACT scores research study. The three most commonly used
are much higher or lower than the other’s, the methods for causal-comparative research are the
researcher may use the technique of ANCOVA to chi-square test, paired-samples and independent t
balance the two groups. This technique is particu- tests, and analysis of variance (ANOVA) or
larly useful when the research design includes ANCOVA.
a pretest, which assesses the dependent variable Pearson’s chi-square, the most commonly used
before any manipulation or treatment has chi-square test, allows the researcher to determine
occurred. For example, to determine the effect of whether there is a statistically significant relation-
an ACT curriculum on students, a researcher ship between the experimental and control groups
would determine the students’ baseline ACT based on frequency counts. This test is useful when
scores. If the control group had scores that were the researcher is working with nominal data, that
much higher to begin with than the experimental is, different categories of treatment or participant
group’s scores, the researcher might use the characteristics, such as gender. For example, if
ANCOVA technique to balance the two groups. a researcher wants to determine whether males
and females learn more efficiently from different
teaching styles, the researcher may compare
a group of male students with a group of female
Instrumentation and Data Collection
students. Both groups may be asked whether they
The methods of collecting data for a causal- learn better from audiovisual aids, group discus-
comparative research study do not differ from any sion, or lecture. The researcher could use chi-
other method of research. Questionnaires, pretests square testing to analyze the data for evidence of
and posttests, various assessments, and behavior a relationship.
observation are common methods for collecting Another method of testing relationships in
data in any research study. It is important, how- causal-comparative research is to use independent
ever, to also gather as much demographic or dependent t tests. When the researcher is
130 Causal-Comparative Design
comparing the mean scores of two groups, these counter this issue, the researcher must test several
tests can determine whether there is a significant different theories to establish whether other vari-
difference between the control and experimental ables affect the dependent variable. The researcher
groups. The independent t test is used in research can reinforce the research hypothesis if he or she
designs when no controls have been applied to the can demonstrate that other variables do not have
samples, while the dependent t test is appropriate a significant impact on the dependent variable.
for designs in which matching has been applied to Reversal causation is another issue that may
the samples. One example of the use of t testing in arise in causal-comparative research. This problem
causal-comparative research is to determine the occurs when it is not clear that the independent
significant difference in math course grades variable caused the changes in the dependent vari-
between two groups of elementary school students able, or that a dependent variable caused the inde-
when one group has completed a math tutoring pendent variable to occur. For example, if
course. If the two samples were matched on cer- a researcher hoped to determine the success rate of
tain variables such as gender and parental support, an advanced English program on students’ grades,
the dependent t test would be used. If no matching he or she would have to determine whether the
was involved, the independent t test would be the English program had a positive effect on the stu-
test of choice. The results of the t test allow the dents, or in the case of reversal causation, whether
researcher to determine whether there is a statisti- students who make higher grades do better in the
cally significant relationship between the indepen- English program. In this scenario, the researcher
dent variable of the math tutoring course and the could establish which event occurred first. If the
dependent variable of math course grade. students had lower grades before taking the
To test for relationships between three or more course, then the argument that the course
groups and a continuous dependent variable, impacted the grades would be stronger.
a researcher might select the statistical technique The inability to construct random samples is
of one-way ANOVA. Like the independent t test, another limitation in causal-comparative research.
this test determines whether there is a significant There is no opportunity to randomly choose parti-
difference between groups based on their mean cipants for the experimental and control groups
scores. In the example of the math tutoring course, because the events or actions have already
the researcher may want to determine the effects occurred. Without random assignment, the results
of the course for students who attended daily ses- cannot be generalized to the public, and thus the
sions and students who attended weekly sessions, researcher’s results are limited to the population
while also assessing students who never attended that has been included in the research study.
sessions. The researcher could compare the aver- Despite this problem, researchers may strengthen
age math grades of the three groups to determine their argument by randomly selecting participants
whether the tutoring course had a significant from the previously established groups. For exam-
impact on the students’ overall math grades. ple, if there were 100 students who had completed
a computer-based learning course, the researcher
would randomly choose 20 students to compare
Limitations
with 20 randomly chosen students who had not
Although causal-comparative research is effective completed the course. Another method of reinfor-
in establishing relationships between variables, cing the study would be to test the hypothesis with
there are many limitations to this type of research. several different population samples. If the results
Because causal-comparative research occurs ex are the same in all or most of the sample, the argu-
post facto, the researcher has no control over the ment will be more convincing.
variables and thus cannot manipulate them. In
addition, there are often variables other than the
Criticisms
independent variable(s) that may impact the
dependent variable(s). Thus, the researcher cannot There have been many criticisms of causal-
be certain that the independent variable caused the comparative research. For the most part, critics
changes in the dependent variable. In order to reject the idea that causal-comparative research
Cause and Effect 131
results should be interpreted as evidence of causal discussion of cause and effect is that of smoking
relationships. These critics believe that there are and lung cancer. A question that has surfaced in
too many limitations in this type of research to cancer research in the past several decades is,
allow for a suggestion of cause and effect. Some What is the effect of smoking on an individual’s
critics are frustrated with researchers who hold health? Also asked is the question, Does smoking
that causal-comparative research provides stronger cause lung cancer? Using data from observational
causal evidence than correlational research does. studies, researchers have long established the rela-
Instead, they maintain that neither type of research tionship between smoking and the incidence of
can produce evidence of a causal relationship, so lung cancer; however, it took compelling evidence
neither is better than the other. Most of these from several studies over several decades to estab-
critics argue that experimental research designs are lish smoking as a ‘‘cause’’ of lung cancer.
the only method of research that can illustrate any The term effect has been used frequently in sci-
type of causal relationships between variables. entific research. Most of the time, it can be seen
Almost all agree, however, that experimental that a statistically significant result from a linear
designs potentially provide the strongest evidence regression or correlation analysis between two
for causation. variables X and Y is explained as effect. Does X
really cause Y or just relate to Y? The association
Ernest W. Brewer and Jennifer Kuhn (correlation) of two variables with each other in
the statistical sense does not imply that one is the
See also Cause and Effect; Correlation; Experimental
cause and the other is the effect. There needs to be
Design; Ex Post Facto Study; Quasi-Experimental
a mechanism that explains the relationship in
Designs
order for the association to be a causal one. For
example, without the discovery of the substance
Further Readings nicotine in tobacco, it would have been difficult to
establish the causal relationship between smoking
Fraenkel, J. R., & Wallen, N. E. (2009). How to design and lung cancer. Tobacco companies have claimed
and evaluate research in education. New York: that since there is not a single randomized con-
McGraw-Hill.
trolled trial that establishes the differences in death
Gay, L. R., Mills, G. E., & Airasian, P. (2009).
Educational research: Competencies for analysis and
from lung cancer between smokers and nonsmo-
applications. Upper Saddle River, NJ: Pearson kers, there was no causal relationship. However,
Education. a cause-and-effect relationship is established by
Lodico, M. G., Spaulding, D. T., & Voegtle, K. H. observing the same phenomenon in a wide variety
(2006). Methods in educational research: From theory of settings while controlling for other suspected
to practice. San Francisco: Jossey-Bass. mechanisms.
Mertler, C. A., & Charles, C. M. (2005). Introduction to Statistical correlation (e.g., association) des-
educational research. Boston: Pearson. cribes how the values of variable Y of a specific
Suter, W. N. (1998). Primer of educational research. population are associated with the values of
Boston: Allyn and Bacon.
another variable X from the same population. For
example, the death rate from lung cancer increases
with increased age in the general population. The
association or correlation describes the situation
CAUSE AND EFFECT that there is a relationship between age and the
death rate from lung cancer. Randomized prospec-
Cause and effect refers to a relationship between tive studies are often used as a tool to establish
two phenomena in which one phenomenon is the a causal effect. Time is a key element in causality
reason behind the other. For example, eating too because the cause must happen prior to the effect.
much fast food without any physical activity leads Causes are often referred to as treatments or expo-
to weight gain. Here eating without any physical sures in a study. Suppose a causal relationship
activity is the ‘‘cause’’ and weight gain is the between an investigational drug A and response Y
‘‘effect.’’ Another popular example in the needs to be established. Suppose YA represents the
132 Ceiling Effect
response when the participant is treated using A for average treatment effect with confounders con-
and Y0 is the response when the subject is trea- trolled have been proposed.
ted with placebo under the same conditions. The
causal effect of the investigational drug is defined Abdus S. Wahed and Yen-Chih Hsu
as the population average δ ¼ EðYA Y0 Þ. How-
See also Clinical Trial; Observational Research;
ever, a person cannot be treated with both placebo
Randomization Tests
and Treatment A under the same conditions. Each
participant in a randomized study will have, usu-
ally, equal potential of receiving Treatment A or Further Readings
the placebo. The responses from the treatment Freedman, D. (2005). Statistical models: Theory and
group and the placebo group are collected at a spe- practice. Cambridge, UK: Cambridge University Press.
cific time after exposure to the treatment or pla- Holland, P. W. (1986). Statistics and causal inference.
cebo. Since participants are randomized to the two Journal of the American Statistical Association,
groups, it is expected that the conditions (repre- 81(396), 945–960.
sented by covariates) are balanced between the
two groups. Therefore, randomization controls for
other possible causes that can affect the response
Y, and hence the difference between the average CEILING EFFECT
responses from the two groups, can be thought of
an estimated causal effect of treatment A on Y. The term ceiling effect is a measurement limitation
Even though a randomized experiment is a pow- that occurs when the highest possible score or
erful tool for establishing a causal relationship, close to the highest score on a test or measurement
a randomization study usually needs a lot of instrument is reached, thereby decreasing the like-
resources and time, and sometimes it cannot be lihood that the testing instrument has accurately
implemented for ethical or practical reasons. Alter- measured the intended domain. A ceiling effect
natively, an observational study may be a good can occur with questionnaires, standardized tests,
tool for causal inference. In an observational study, or other measurements used in research studies. A
the probability of receiving (or not receiving) treat- person’s reaching the ceiling or scoring positively
ment is assessed and accounted for. In the example on all or nearly all the items on a measurement
of the effect of smoking on lung cancer, smoking instrument leaves few items to indicate whether
and not smoking are the treatments. However, for the person’s true level of functioning has been
ethical reasons, it is not practical to randomize accurately measured. Therefore, whether a large
subjects to treatments. Therefore, researchers had percentage of individuals reach the ceiling on an
to rely on observational studies to establish the instrument or whether an individual scores very
causal effect of smoking on lung cancer. high on an instrument, the researcher or inter-
Causal inference plays a significant role in preter has to consider that what has been mea-
medicine, epidemiology, and social science. An sured may be more of a reflection of the
issue about the average treatment effect is also parameters of what the instrument is able to mea-
worth mentioning. The average treatment effect, sure than of how the individuals may be ultimately
δ ¼ EðY1 Þ EðY2 Þ, between two treatments is functioning. In addition, when the upper limits of
defined as the difference between two outcomes, a measure are reached, discriminating between the
but, as mentioned previously, a subject can receive functioning of individuals within the upper range
only one of ‘‘rival’’ treatments. In other words, it is is difficult. This entry focuses on the impact of ceil-
impossible for a subject to have two outcomes at ing effects on the interpretation of research results,
the same time. Y1 and Y2 are called counterfactual especially the results of standardized tests.
outcomes. Therefore, the average treatment effect
can never be observed. In the causal inference liter-
Interpretation of Research Results
ature, several estimating methods of average treat-
ment effect are proposed to deal with this When a ceiling effect occurs, the interpretation of
obstacle. Also, for observational study, estimators the results attained is impacted. For example,
Ceiling Effect 133
a health survey may include a range of questions performance on the original sampling. Standard-
that focus on the low to moderate end of physical ized tests include aptitude tests such as the Wechs-
functioning (e.g., individual is able to walk up ler Intelligence Scales, the Stanford-Binet Scales,
a flight of stairs without difficulty) versus a range and the Scholastic Aptitude Test, and achievement
of questions that focus on higher levels of physical tests such as the Iowa Test of Basic Skills and the
functioning (e.g., individual is able to walk at Woodcock-Johnson: Tests of Achievement.
a brisk pace for 1 mile without difficulty). Ques- When individuals score at the upper end of
tions within the range of low to moderate physical a standardized test, especially 3 standard devia-
functioning provide valid items for individuals on tions from the mean, then the ceiling effect is a fac-
that end of the physical functioning spectrum tor. It is reasonable to conclude that such a person
rather than for those on the higher end of the has exceptional abilities compared with the aver-
physical functioning spectrum. Therefore, if an age person within the sampled population, but the
instrument geared toward low to moderate physi- high score is not necessarily a highly reliable mea-
cal functioning is administered to individuals with sure of the person’s true ability. What may have
physical health on the upper end of the spectrum, been measured is more a reflection of the test than
a ceiling effect will likely be reached in a large por- of the person’s true ability. In order to attain an
tion of the cases, and interpretation of their ulti- indication of the person’s true ability when the
mate physical functioning would be limited. ceiling has been reached, an additional measure
A ceiling effect can be present within results with an increased range of difficult items would be
of a research study. For example, a researcher appropriate to administer. If such a test is not
may administer the health survey described in available, the test performance would be inter-
the previous paragraph to a treatment group in preted stating these limitations.
order to measure the impact of a treatment on The ceiling effect should also be considered
overall physical health. If the treatment group when one is administering a standardized test to
represents the general population, the results an individual who is at the top end of the age
may show a large portion of the treatment group range for a test and who has elevated skills. In this
to have benefited from the treatment because situation, the likelihood of a ceiling effect is high.
they have scored high on the measure. However, Therefore, if a test administrator is able to use
this high score may signify the presence of a ceil- a test that places the individual on the lower age
ing effect, which calls for caution when one end of a similar or companion test, the chance of
is interpreting the significance of the positive a ceiling effect would most likely be eliminated.
results. If a ceiling effect is suspected, an alterna- For example, the Wechsler Intelligence Scales have
tive would be to use another measure that separate measures to allow for measurement of
provides items that target better physical func- young children, children and adolescents, and
tioning. This would allow participants to dem- older adolescents and adults. The upper end and
onstrate a larger degree of differentiation in lower end of the measures overlap, meaning that
physical functioning and provide a measure that a 6- to 7-year-old could be administered the
is more sensitive to change or growth from the Wechsler Preschool and Primary Scale of Intelli-
treatment. gence or the Wechsler Intelligence Scale for Chil-
dren. In the event a 6-year-old is cognitively
advanced, the Wechsler Intelligence Scale for Chil-
Standardized Tests
dren would be a better choice in order to avoid
The impact of the ceiling effect is important when a ceiling effect.
interpreting standardized test results. Standardized It is important for test developers to monitor
tests have higher rates of score reliability because ceiling effects on standardized instruments as they
the tests have been administered to a large sam- are utilized in the public. If a ceiling effect is
pling of the population. The large sampling pro- noticed within areas of a test over time, those ele-
vides a scale of standard scores that reliably and ments of the measure should be improved to pro-
validly indicates how close a person who takes the vide better discrimination for high performers.
same test performs compared with the mean The rate of individuals scoring at the upper end
134 Central Limit Theorem
of a measure should coincide with the standard distribution, then the probability that it is larger
scores and percentiles on the normal curve. than a and smaller than b is equal to the integral
of the function f(x) (the area under the graph of
Tish Holub Taylor the function) from x ¼ a to x ¼ b. The normal den-
sity is also known as Gaussian density, named for
See also Instrumentation; Standardized Score; Survey;
Carl Friedrich Gauss, who used this function to
Validity of Measurement
describe astronomical data. If we put μ ¼ 0 and
σ 2 ¼ 1 in the above formula, then we obtain the
Further Readings
so-called standard normal density.
Austin, P. C., & Bruner, L. J. (2003). Type I error In precise mathematical language, the CLT
inflation in the presence of a ceiling effect. American states the following: Suppose that X1 ; X2 ; . . . are
Statistician, 57(2), 97–105. independent random variables with the same dis-
Gunst, R. F., & Barry, T. E. (2003). One way to moderate tribution, having mean μ and variance σ 2 but
ceiling effects. Quality Progress, 36(10), 84–86. being otherwise arbitrary. Let Sn ¼ X1 þ þ Xn
Kaplan, C. (1992). Ceiling effects in assessing high-IQ
be their sum. Then
children with the WPPSI-R. Journal of Clinical Child
Psychology, 21(4), 403–406. Z b
Rifkin, B. (2005). A ceiling effect in traditional classroom Sn nμ 1 2
foreign language instruction: Data from Russian.
Pða < pffiffiffiffiffiffiffiffi < bÞ → pffiffiffiffiffiffi et =2 dt,
nσ 2 a 2π
Modern Language Journal, 89(1), 3–18.
Sattler, J. M. (2001). Assessment of children: Cognitive as n → ∞:
applications (4th ed.). San Diego, CA: Jerome M. Sattler.
Taylor, R. L. (1997). Assessment of exceptional students: It is more appropriate to define the standard
Educational and psychological procedures (4th ed.). normal density as the density of a random variable
Boston: Allyn and Bacon. ζ with zero mean and variance 1 with the property
Uttl, B. (2005). Measurement of individual differences:
that, for every a and b there is c such that if ζ1 ; ζ2
Lessons from memory assessment in research and
clinical practice. Psychological Science, 16(6), 460–467.
are independent copies of ζ, then aζ1 þ bζ2 is
a copy of cζ . It follows that a2 þ b2 ¼ c2 holds
and that there is only one choice for the density of
ζ, namely, the standard normal density.
CENTRAL LIMIT THEOREM As an example, consider tossing a fair coin
n ¼ 1; 000 times and determining the probability
The central limit theorem (CLT) is, along with the that fewer than 450 heads are obtained. The
theorems known as laws of large numbers, the cor- CLT can be used to give a good approximation
nerstone of probability theory. In simple terms, the of this probability. Indeed, if we let Xi be a ran-
theorem describes the distribution of the sum of dom variable that takes value 1 if heads show up
a large number of random numbers, all drawn at the ith toss or value 0 if tails show up, then
independently from the same probability distribu- we see that the assumptions of the CLT are satis-
tion. It predicts that, regardless of this distribution, fied because the random variables have the same
as long as it has finite variance, then the sum fol- mean μ ¼ 1=2 and variance σ 2 ¼ 1=4. On
lows a precise law, or distribution, known as the the other hand, Sn ¼ X1 þ þ Xn is the number
normal distribution. of heads.pffiffiffiffiffiffiffi
Since
ffi Sn ≤ 450 ifpffiffiffiffiffiffiffiffi
andffi only if
Let us describe the normal distribution with ðSn nμÞ= nσ ≤ ð450 500Þ= 250 ¼ 3:162,
2
mean μ and variance σ 2 : It is defined through its we find, by the CLT, that the probability that we
density function, get at most 450 heads equals the integral of the
standard density from ∞ to 3:162. This inte-
1 2 2 gral can be computed with the help of a com-
f ðxÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi eðxμÞ =2σ ; puter (or tables in olden times) and found to be
2πσ 2
about 0.00078, which is a reasonable approxi-
where the variable x ranges from ∞ to þ ∞. mation. Incidentally, this kind of thing leads to
This means that if a random variable follows this the so-called statistical hypothesis testing: If we
Central Limit Theorem 135
toss a coin and see 430 heads and 570 tails, then practice. In 1935, Paul Lévy and William Feller
we should be suspicious that the coin is not fair. established necessary conditions for the validity of
the CLT. In 1951, H. F. Trotter gave an elementary
analytical proof of the CLT.
Origins
The origins of the CLT can be traced to a paper
The Functional CLT
by Abraham de Moivre (1733), who described
the CLT for symmetric Bernoulli trials; that is, in The functional CLT is stated for summands that
tossing a fair coin n times, the number Sn of take values in a multidimensional (one talks of
heads has apffiffidistribution that is approximately random vectors in a Euclidean space) or even
that of n2 þ 2n ζ: The result fell into obscurity but infinite-dimensional space. A very important
was revived in 1812 by Pierre-Simon Laplace, instance of the functional CLT concerns conver-
who proved and generalized de Moivre’s result gence to Brownian motion, which also provides
to asymmetric Bernoulli trials (weighted coins). a means of defining this most fundamental object
Nowadays, this particular case is known as the of modern probability theory. Suppose that Sn is
de Moivre–Laplace CLT and is usually proved as stated in the beginning of this entry. Define
using Stirling’s approximation for the product n!ffi
pffiffiffiffiffiffiffiffi a function sn(t) of ‘‘time’’ as follows: At time
positive integers: n! ≈ nn en 2πn:
of the first npffiffiffiffiffiffi t ¼ k/n, where k is pa ffiffiffipositive integer, let sn(t) have
The factor 2π here is the same as the one value ðSk μkÞ=σ n; now join the points [k/n,
appearing in the normal density. Two 19th- sn(k/n)], [(k þ 1)/n, sn(k þ 1)/n)] by a straight line
century Russian mathematicians, Pafnuty Che- segment, for each value of k, to obtain the graph
byshev and A. A. Markov, generalized the CLT of a continuous random function sn(t). Donsker’s
of their French predecessors and proved it using theorem states that the probability distribution of
the method of moments. the random function sn(t) converges (in a certain
sense) to the distribution of a random function, or
stochastic process, which can be defined to be the
The Modern CLT
standard Brownian motion.
The modern formulation of the CLT gets rid of the
assumption that the summands be identically dis-
Other Versions and Consequences of the CLT
tributed random variables. Its most general and
useful form is that of the Finnish mathematician Versions of the CLT for dependent random vari-
J. W. Lindeberg (1922). It is stated for triangular ables also exist and are very useful in practice
arrays, that is, random variables Sn ¼ Xn;1 ; when independence is either violated or not possi-
Xn;2 ; . . . ; Xn;kn , depending on two indices that, for ble to establish. Such versions exist for Markov
fixed n, are independent with respect to the second processes, for regenerative processes, and for mar-
index. Letting Sn ¼ X1 þ þ Xn;kn be the sum tingales, as well as others.
with respect to the second index, we have that Sn The work on the CLT gave rise to the general
minus its mean, ESn, and divided by its standard area known as weak convergence (of stochastic
deviation, has a distribution that, as n tends to ∞, processes), the origin of which is in Yuri Pro-
is standard normal, provided an asymptotic neg- khorov (1956) and Lucien Le Cam (1957). Nowa-
ligibility condition for the variances holds. A ver- days, every result of CLT can (and should) be seen
sion of Lindeberg’s theorem was formulated by in the light of this general theory.
A. Liapounoff in 1901. It is worth pointing out Finally, it should be mentioned that all flavors
that Liapounoff introduced a new proof technique of the CLT rely heavily on the assumption that
based on characteristic functions (also known as the probability that the summands be very large
Fourier transforms), whereas Lindeberg’s tech- is small, which is often cast in terms of finiteness
nique was an ingenious step-by-step replacement of a moment such as the variance. In a variety of
of the general summands by normal ones. Lia- applications, heavy-tailed random variables do
pounoff’s theorem has a condition that is weaker not satisfy this assumption. The study of limit
than that of Lindeberg but is quite good in theorems of sums of independent heavy-tailed
136 Central Limit Theorem
25 þ 20 þ 30 ¼ 75:
578 | 8 9 10:
In this example, the median falls between two
Finally, divide the total summed scores by the total
identical scores, so one can still say that the median
number of students:
is 8. If the two middle numbers were different, one
5; 675=75 ¼ 75:67 would find the middle number between the two
numbers. For example, if one increased one of the
This is the weighted mean for the test scores of the student’s homework scores from an 8 to a 9,
three classes taught by Carla.
578 | 9 9 10:
Median
In this case, the middle falls halfway between 8
The median is the second measure of central ten- and 9, at a score of 8.5.
dency. It is defined as the score that cuts the distri- Statisticians disagree over the correct method for
bution exactly in half. Much as the mean can be calculating the median when the distribution has
described as the balance point, where the values multiple repeated scores in the center of the distribu-
on each side are identical, the median is the point tion. Some statisticians use the methods described
where the number of scores on each side is equal. above to find the median, whereas others believe the
As such, the median is influenced more by the scores in the middle need to be reduced to fractions
number of scores in the distribution than by the to find the exact midpoint of the distribution. So in
values of the scores in the distribution. The median a distribution with the following scores,
is also the same as the 50th percentile of any distri-
bution. Generally the median is not abbreviated or 2 3 3 4 5 5 5 5 5 6,
symbolized, but occasionally Mdn is used.
The median is simple to identify. The method some statisticians would say the median is 5,
used to calculate the median is the same for both whereas others (using the fraction method) would
samples and populations. It requires only two steps report the median as 4.7.
to calculate the median. In the first step, order the
numbers in the sample from lowest to highest. So Mode
if one were to use the homework scores from the
mean example, 10, 8, 5, 7, and 9, one would first The mode is the last measure of central tendency.
order them 5, 7, 8, 9, 10. In the second step, find It is the value that occurs most frequently. It is the
the middle score. In this case, there is an odd num- simplest and least precise measure of central
ber of scores, and the score in the middle is 8. tendency. Generally, in writing about the mode,
scholars label it simply mode, although some
57 8 9 10 books or papers use Mo as an abbreviation. The
method for finding the mode is the same for both
Notice that the median that was calculated is not samples and populations. Although there are
the same as the mean for the same sample of several ways one could find the mode, a simple
homework assignments. This is because of the method is to list each score that appears in the
Central Tendency, Measures of 141
sample. The score that appears the most often is distributed, and there is no specific reason that one
the mode. For example, given the following sam- would want to use a different measure of central
ple of numbers, tendency, then the mean should be the best measure
to use.
3; 4; 6; 2; 7; 4; 5; 3; 4; 7; 4; 2; 6; 4; 3; 5;
Median
one could arrange them in numerical order:
There are several reasons to use the median
2; 2; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; 6; 6; 7; 7: instead of the mean. Many statisticians believe that
it is inappropriate to use the mean to measure cen-
Once the numbers are arranged, it becomes
tral tendency if the distribution was measured at
apparent that the most frequently appearing num-
the ordinal level. Because variables measured at
ber is 4:
the ordinal level contain information about direc-
tion but not distance, and because the mean is
2; 2; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; 6; 6; 7; 7: measured in terms of distance, using the mean to
calculate central tendency would provide informa-
Thus 4 is the mode of that sample of numbers.
tion that is difficult to interpret.
Unlike the mean and the median, it is possible to
Another occasion to use the median is when the
have more than one mode. If one were to add two
distribution contains an outlier. An outlier is
threes to the sample,
a value that is very different from the other values.
2; 2; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; Outliers tend to be located at the far extreme of
the distribution, either high or low. As the mean is
6; 6; 7; 7;
so sensitive to the value of the scores, using the
then both 3 and 4 would be the most commonly mean as a measure of central tendency in a distri-
occurring number, and the mode of the sample bution with an outlier would result in a nonrepre-
would be 3 and 4. The term used to describe a sam- sentative score. For example, looking again at the
ple with two modes is bimodal. If there are more five homework assignment scores, if one were to
than two modes in a sample, one says the sample replace the nine with a score of 30,
is multimodal. 5 7 8 9 10 becomes 5 7 8 30 10
5 þ 7 þ 8 þ 30 þ 10 ¼ 60
When to Use Each Measure
60=5 ¼ 12:
Because each measure of central tendency is calcu-
lated with a different method, each measure is dif- By replacing just one value with an outlier, the
ferent in its precision of measuring the middle, as newly calculated mean is not a good representa-
well as which numbers it is best suited for. tion of the average values of our distribution. The
same would occur if one replaced the score of 9
Mean with a very low number:
The mean is often used as the default measure of 5 7 8 9 10 becomes 5 7 8 4 10
central tendency. As most people understand the
5 þ 7 þ 8 þ 4 þ 10 ¼ 26
concept of average, they tend to use the mean
whenever a measure of central tendency is needed, 26=5 ¼ 5:2:
including times when it is not appropriate to use
the mean. Many statisticians would argue that the Since the mean is so sensitive to outliers, it is
mean should not be calculated for numbers that are best to use the median for calculating central ten-
measured at the nominal or ordinal levels of mea- dency. Examining the previous example, but using
surement, due to difficulty in the interpretation of the median,
the results. The mean is also used in other statistical
5 7 8 4 10
analyses, such as calculations of standard deviation.
If the numbers in the sample are fairly normally 4 5 7 8 10:
142 Change Scores
The middle number is 7, therefore the median is fractions of numbers, making it difficult to inter-
7, a much more representative number than the pret the results. A common example of this is the
mean of that sample, 5.2. saying, ‘‘The average family has a mom, a dad,
If the numbers in the distribution were mea- and 2.5 kids.’’ People are discrete variables, and as
sured on an item that had an open-ended option, such, they should never be measured in such a way
one should use the median as the measure of cen- as to obtain decimal results. The mode can also be
tral tendency. For example, a question that asks used to provide additional information, along with
for demographic information such as age or salary other calculations of central tendency. Information
may include either a lower or upper category that about the location of the mode compared with the
is open ended: mean can help determine whether the distribution
they are both calculated from is skewed.
Number of Cars Owned Frequency
0 4 Carol A. Carman
1 10
See also Descriptive Statistics; Levels of Measurement;
2 16
Mean; Median; Mode; ‘‘On the Theory of Scales of
3 or more 3
Measurement’’; Results Section; Sensitivity; Standard
Deviation; Variability, Measure of
The last answer option is open ended because
an individual with three cars would be in the same
category as someone who owned 50 cars. As such, Further Readings
it is impossible to accurately calculate the mean
Coladarci, T., Cobb, C. D., Minium, E. W., & Clarke, R.
number of cars owned. However, it is possible to
C. (2004). Fundamentals of statistical reasoning in
calculate the median response. For the above
education. Hoboken, NJ: Wiley.
example, the median would be 2 cars owned. Gravetter, F. J., & Wallnau, L. B. (2004). Statistics for the
A final condition in which one should use the behavioral sciences (6th ed.). Belmont, CA: Thomson
median instead of the mean is when one has incom- Wadsworth.
plete information. If one were collecting survey Salkind, N. J. (2008). Statistics for people who (think
information, and some of the participants refused they) hate statistics. Thousand Oaks, CA: Sage.
or forgot to answer a question, one would have Thorndike, R. M. (2005). Measurement and evaluation
responses from some participants but not others. It in psychology and education. (7th ed.). Upper Saddle
would not be possible to calculate the mean in this River, NJ: Pearson Education.
instance, as one is missing important information
that could change the mean that would be calcu-
lated if the missing information were known.
CHANGE SCORES
Mode
The measurement of change is fundamental in the
The mode is well suited to be used to measure social and behavioral sciences. Many researchers
the central tendency of variables measured at the have used change scores to measure gain in ability
nominal level. Because variables measured at the or shift in attitude over time, or difference scores
nominal level are given labels, then any number between two variables to measure a construct
assigned to these variables does not measure quan- (e.g., self-concept vs. ideal self). This entry intro-
tity. As such, it would be inappropriate to use the duces estimation of change scores, its assumptions
mean or the median with variables measured at and applications, and at the end offers a recom-
this level, as both of those measures of central ten- mendation on the use of change scores.
dency require calculations involving quantity. The Let Y and X stand for the measures obtained by
mode should also be used when finding the middle applying the same test to the subjects on two occa-
of distribution of a discrete variable. As these vari- sions. Observed change or difference score is
ables exist only in whole numbers, using other D ¼ Y X. The true change is DT ¼ YT XT ,
methods of central tendency may result in where YT and XT represent the subject’s true status
Change Scores 143
at these times. The development of measuring the be borrowed from the other measure (e.g., Y). The
true change DT follows two paths, one using estimator obtained with the Lord procedure is bet-
change score and the other using residual change ter than those from the previous two ways and the
score. raw change score in that it yields a smaller mean
The only assumption to calculate change score square of deviation between the estimate and the
is that Y (e.g., posttest scores) and X (e.g., pretest true change (D ^ T DT ).
scores) should be on the same numerical scale; that Lee Cronbach and Lita Furby, skeptical of
is, the scores on posttest are comparable to scores change scores, proposed a better estimate of true
on pretest. This only requirement does not suggest change by incorporating information in regression
that pretest and posttest measure the same con- of two additional categories of variables besides
struct. Thus, such change scores can be extended the pretest and posttest measures used in Lord’s
to any kind of difference score between two mea- procedure. The two categories of variables include
sures (measuring the same construct or not) that Time 1 measures W (e.g., experience prior to the
are on the same numerical scale. The two mea- treatment, demographic variables, different treat-
sures are linked, as if the two scores are obtained ment group membership) and Time 2 measures Z
from a single test or two observations are made by (e.g., a follow-up test a month after the treatment).
the same observer. The correlation between linked W and X need not be simultaneous, nor do Z and
observations (e.g., two observations made by the Y. Note that W and Z can be multivariate. Addi-
same observer) will be higher than that between tional information introduced by W and Z
independent observations (e.g., two observations improves the estimation of true change with smal-
made by different observers). Such linkage must be ler mean squares of (D ^ T DT ) if the sample size is
considered in defining the reliability coefficient for large enough for the weight of the information
difference scores. The reliability of change or dif- related to W and Z to be accurately estimated.
ference scores is defined as the correlation of the The other approach to estimating the true
scores with independently observed difference change DT is to use the residual change score. The
scores. The reliability for change scores produced development of the residual change score estimate
by comparing two independent measures will most is similar to that of the change score estimate. The
likely be smaller than that for the linked case. raw residual-gain score is obtained by regressing
Raw change or difference scores are computed the observed posttest measure Y on the observed
with two observed measures (D ¼ Y X). pretest measure X. In this way, the portion in the
Observed scores are systematically related to ran- posttest score that can be predicted linearly from
dom error of measurement and thus unreliable. pretest scores is removed from the posttest score.
Conclusions based on these scores tend to be Compared with change score, residualized change
fallacious. is not a more correct measure in that it might
True change score is measured as the difference remove some important and genuine change in the
between the person’s true status at posttest and subject. The residualized score helps identify indi-
pretest times, DT ¼ YT XT . The key is to viduals who changed more (or less) than expected.
remove the measurement error from the two The true residual-gain score is defined as the
observed measures. There are different ways to expected value of raw residual-gain score over
correct the errors in the two raw measures used to many observations on the same person. It is the
obtain raw gain scores. The first way is to correct residual obtained for a subject in the population
the error in pretest scores using the reliability coef- linear regression of true final status on true initial
ficient of X and simple regression. The second way status. The true residual gain is obtained by extend-
is to correct errors in both pretest and posttest ing Lord’s multiple-regression approach and can be
scores using the reliability coefficient of both X further improved by including W or Z information.
and Y and simple regression. The third way is the In spite of all the developments in the estimation
Lord procedure. With this procedure, the estimates of change or residual change scores, researchers are
of YT and XT are obtained by the use of a multiple often warned not to use these change scores. Cron-
regression procedure that incorporates the reliabil- bach and Furby summarized four diverse research
ity of a measure (e.g., X) and information that can issues in the measurement-of-change literature that
144 Chi-Square Test
may appear to require change scores, including analyzes grosser data than do parametric tests such
(1) providing a measure of individual change, as t tests and analyses of variance (ANOVAs), the
(2) investigating correlates or predictors of change chi-square test can report only whether groups in
(or change rate), (3) identifying individuals with a sample are significantly different in some mea-
slowest change rate for further special treatment, sured attribute or behavior; it does not allow one
and (4) providing an indicator of a construct that to generalize from the sample to the population
can serve as independent, dependent, or covariate from which it was drawn. Nonetheless, because
variables. However, for most of these questions, the chi-square is less ‘‘demanding’’ about the data it
estimation of change scores is unnecessary and is will accept, it can be used in a wide variety of
at best inferior to other methods of analysis. One research contexts. This entry focuses on the appli-
example is one-group studies of the pretest– cation, requirements, computation, and interpreta-
treatment–posttest form. The least squares estimate tion of the chi-square test, along with its role in
of the mean true gain is simply the difference determining associations among variables.
between mean observed pretest and posttest scores.
Hypothesis testing and estimation related to treat-
Bivariate Tabular Analysis
ment effect should be addressed directly to
observed sample means. This suggests that the anal- Though one can apply the chi-square test to a sin-
ysis of change does not need estimates of true gle variable and judge whether the frequencies for
change scores. It is more appropriate to directly use each category are equal (or as expected), a chi-
raw score mean vectors, covariance matrices, and square is applied most commonly to frequency
estimated reliabilities. Cronbach and Furby recom- results reported in bivariate tables, and interpret-
mended that, when investigators ask questions in ing bivariate tables is crucial to interpreting the
which change scores appear to be the natural mea- results of a chi-square test. Bivariate tabular analy-
sure to be obtained, the researchers should frame sis (sometimes called crossbreak analysis) is used
their questions in other ways. to understand the relationship (if any) between
two variables. For example, if a researcher wanted
Feifei Ye to know whether there is a relationship between
the gender of U.S. undergraduates at a particular
See also Gain Scores, Analysis of; Longitudinal Design;
university and their footwear preferences, he or
Pretest–Posttest Design
she might ask male and female students (selected
as randomly as possible), ‘‘On average, do you
Further Readings prefer to wear sandals, sneakers, leather shoes,
boots, or something else?’’ In this example, the
Chan, D. (2003). Data analysis and modeling
independent variable is gender and the dependent
longitudinal processes. Group & Organization
Management, 28, 341–365.
variable is footwear preference. The independent
Cronbach, L. J., & Furby, L. (1970). How should we variable is the quality or characteristic that the
measure ‘‘change’’—or should we? Psychological researcher hypothesizes helps to predict or explain
Bulletin, 74(1), 68–80. some other characteristic or behavior (the depen-
Linn, R. L., & Slinde, J. A. (1977). The determination of dent variable). Researchers control the indepen-
the significance of change between pre- and posttesting dent variable (in this example, by sampling males
periods. Review of Educational Research, 47(1), and females) and elicit and measure the dependent
121–150. variable to test their hypothesis that there is some
relationship between the two variables.
To see whether there is a systematic relationship
between gender of undergraduates at University X
CHI-SQUARE TEST and reported footwear preferences, the results could
be summarized in a table as shown in Table 1.
The chi-square test is a nonparametric test of the Each cell in a bivariate table represents the
statistical significance of a relation between two intersection of a value on the independent variable
nominal or ordinal variables. Because a chi-square and a value on the dependent variable by showing
Chi-Square Test 145
Table 1 Male and Female Undergraduate Footwear data more easily, but how confident can one be
Preferences at University X (Raw Frequencies) that those apparent patterns actually reflect a sys-
Leather tematic relationship in the sample between the
Group Sandals Sneakers Shoes Boots Other variables (between gender and footwear prefer-
Male 6 17 13 9 5 ence, in this example) and not just a random
Female 13 5 7 16 9 distribution?
Chi-Square Requirements
Table 2 Male and Female Undergraduate Footwear The chi-square test of statistical significance is
Preferences at University X (Percentages)
a series of mathematical formulas that compare
Leather the actual observed frequencies of the two vari-
Group Sandals Sneakers Shoes Boots Other N ables measured in a sample with the frequencies
Male 12 34 26 18 10 50 one would expect if there were no relationship at
Female 26 10 14 32 18 50 all between those variables. That is, chi-square
assesses whether the actual results are different
enough from the null hypothesis to overcome a cer-
how many times that combination of values was tain probability that they are due to sampling
observed in the sample being analyzed. Typically, error, randomness, or a combination.
in constructing bivariate tables, values on the inde- Because chi-square is a nonparametric test, it
pendent variable are arrayed on the vertical axis, does not require the sample data to be at an inter-
while values on the dependent variable are arrayed val level of measurement and more or less nor-
on the horizontal axis. This allows one to read mally distributed (as parametric tests such as t
‘‘across,’’ from values on the independent variable tests do), although it does rely on a weak assump-
to values on the dependent variable. (Remember, tion that each variable’s values are normally dis-
an observed relationship between two variables is tributed in the population from which the sample
not necessarily causal.) is drawn. But chi-square, while forgiving, does
Reporting and interpreting bivariate tables is have some requirements:
most easily done by converting raw frequencies (in
each cell) into percentages of each cell within the 1. Chi-square is most appropriate for analyzing
categories of the independent variable. Percentages relationships among nominal and ordinal vari-
basically standardize cell frequencies as if there ables. A nominal variable (sometimes called a cate-
were 100 subjects or observations in each category gorical variable) describes an attribute in terms of
of the independent variable. This is useful for com- mutually exclusive, nonhierarchically related cate-
paring across values on the independent variable if gories, such as gender and footwear preference.
the raw row totals are close to or more than 100, Ordinal variables measure an attribute (such as
but increasingly dangerous as raw row totals military rank) that subjects may have more or less
become smaller. (When reporting percentages, one of but that cannot be measured in equal incre-
should indicate total N at the end of each row or ments on a scale. (Results from interval variables,
independent variable category.) such as scores on a test, would have to first be
Table 2 shows that in this sample roughly twice grouped before they could ‘‘fit’’ into a bivariate
as many women as men preferred sandals and table and be analyzed with chi-square; this group-
boots, about 3 times more men than women pre- ing loses much of the incremental information of
ferred sneakers, and twice as many men as women the original scores, so interval data are usually
preferred leather shoes. One might also infer from analyzed using parametric tests such as ANOVAs
the ‘‘Other’’ category that female students in this and t tests. The relationship between two ordinal
sample had a broader range of footwear prefer- variables is usually best analyzed with a Spearman
ences than did male students. rank order correlation.)
Converting raw observed values or frequencies 2. The sample must be randomly drawn from
into percentages allows one to see patterns in the the population.
146 Chi-Square Test
3. Data must be reported in raw frequencies— solution is to combine, or collapse, two cells
not, for example, in percentages. together. (Categories on a variable cannot be
excluded from a chi-square analysis; a researcher
4. Measured variables must be independent of
cannot arbitrarily exclude some subset of the data
each other. Any observation must fall into only
from analysis.) But a decision to collapse categories
one category or value on each variable, and no cat-
should be carefully motivated, preserving the integ-
egory can be inherently dependent on or influenced
rity of the data as it was originally collected.
by another.
5. Values and categories on independent and
dependent variables must be mutually exclusive Computing the Chi-Square Value
and exhaustive. In the footwear data, each subject
The process by which a chi-square value is com-
is counted only once, as either male or female and
puted has four steps:
as preferring sandals, sneakers, leather shoes,
boots, or other kinds of footwear. For some vari- 1. Setting the p value. This sets the threshold of
ables, no ‘‘other’’ category may be needed, but tolerance for error. That is, what odds is the
often ‘‘other’’ ensures that the variable has been researcher willing to accept that apparent patterns
exhaustively categorized. (Some kinds of analysis in the data may be due to randomness or sampling
may require an ‘‘uncodable’’ category.) In any case, error rather than some systematic relationship
the results for the whole sample must be included. between the measured variables? The answer
depends largely on the research question and the
6. Expected (and observed) frequencies cannot
consequences of being wrong. If people’s lives
be too small. Chi-square is based on the expecta-
depend on the interpretation of the results, the
tion that within any category, sample frequencies
researcher might want to take only one chance in
are normally distributed about the expected popu-
100,000 (or 1,000,000) of making an erroneous
lation value. Since frequencies of occurrence can-
claim. But if the stakes are smaller, he or she might
not be negative, the distribution cannot be normal
accept a greater risk—1 in 100 or 1 in 20. To mini-
when expected population values are close to zero
mize any temptation for post hoc compromise of
(because the sample frequencies cannot be much
scientific standards, researchers should explicitly
below the expected frequency while they can be
motivate their threshold before they perform any
much above it). When expected frequencies are
test of statistical significance. For the footwear
large, there is no problem with the assumption of
study, we will set a probability of error threshold
normal distribution, but the smaller the expected fre-
of 1 in 20, or p < :05:
quencies, the less valid the results of the chi-square
test. In addition, because some of the mathematical 2. Totaling all rows and columns. See Table 3.
formulas in chi-square use division, no cell in a table
3. Deriving the expected frequency of each cell.
can have an observed raw frequency of zero.
Chi-square operates by comparing the observed
The following minimums should be obeyed:
frequencies in each cell in the table to the frequen-
For a 1 × 2 or 2 × 2 table, expected frequencies in
cies one would expect if there were no relationship
each cell should be at least 5. at all between the two variables in the popula-
tions from which the sample is drawn (the null
For larger tables ð2 × 4 or 3 × 3 or larger), if all
expected frequencies but one are at least 5 and if
Table 3 Male and Female Undergraduate Footwear
the one small cell is at least 1, chi-square is
Preferences at University X: Observed
appropriate. In general, the greater the degrees of
Frequencies With Row and Column Totals
freedom (i.e., the more values or categories on the
independent and dependent variables), the more Leather
lenient the minimum expected frequencies Group Sandals Sneakers Shoes Boots Other Total
threshold. Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Sometimes, when a researcher finds low expec-
Total 19 22 20 25 14 100
ted frequencies in one or more cells, a possible
Chi-Square Test 147
Table 4 Male and Female Undergraduate Footwear 4. Measuring the size of the difference between
Preferences at University X: Observed and the observed and expected frequencies in each cell.
Expected Frequencies To do this, calculate the difference between the
Leather observed and expected frequency in each cell,
Group Sandals Sneakers Shoes Boots Other Total square that difference, and then divide that prod-
Male: 6 17 13 9 5 50 uct by the difference itself. The formula can be
observed 9.5 11 10 12.5 7 expressed as ðO EÞ2 =E:
expected Squaring the difference ensures a positive num-
ber, so that one ends up with an absolute value of
Female: 13 5 7 16 9 50 differences. (If one did not work with absolute
observed 9.5 11 10 12.5 7 values, the positive and negative differences across
expected the entire table would always add up to 0.) Divid-
ing the squared difference by the expected fre-
Total 19 22 20 25 14 100 quency essentially removes the expected frequency
from the equation, so that the remaining measures
of observed versus expected difference are compa-
hypothesis). The null hypothesis—the ‘‘all other rable across all cells.
things being equal’’ scenario—is derived from the So, for example, the difference between obser-
observed frequencies as follows: The expected fre- ved and expected frequencies for the Male–Sandals
quency in each cell is the product of that cell’s row preference is calculated as follows:
total multiplied by that cell’s column total, divided
by the sum total of all observations. So, to derive 1. Observed (6) Expected (9.5) ¼ Difference
the expected frequency of the ‘‘Males who prefer (–3.5)
Sandals’’ cell, multiply the top row total (50) by 2. Difference ( 3.5) squared ¼ 12.25
the first column total (19) and divide that product
3. Difference squared (12.25)/Expected ð9:5Þ ¼
by the sum total (100): ðð50 × 19Þ=100Þ ¼ 9:5:
1.289
The logic of this is that we are deriving the
expected frequency of each cell from the union of
The sum of all products of this calculation on
the total frequencies of the relevant values on each
each cell is the total chi-square value for the table:
variable (in this case, Male and Sandals), as a pro-
14.026.
portion of all observed frequencies in the sample.
This calculation produces the expected frequency
Interpreting the Chi-Square Value
of each cell, as shown in Table 4.
Now a comparison of the observed results with Now the researcher needs some criterion against
the results one would expect if the null hypothesis which to measure the table’s chi-square value in
were true is possible. (Because the sample includes order to tell whether it is significant (relative to the
the same number of male and female subjects, the p value that has been motivated). The researcher
male and female expected scores are the same. needs to know the probability of getting a chi-
This will not be the case with unbalanced sam- square value of a given minimum size even if the
ples.) This table can be informally analyzed, com- variables are not related at all in the sample. That
paring observed and expected frequencies in each is, the researcher needs to know how much larger
cell (e.g., males prefer sandals less than expected), than zero (the chi-square value of the null hypoth-
across values on the independent variable (e.g., esis) the table’s chi-square value must be before the
males prefer sneakers more than expected, females null hypothesis can be rejected with confidence.
less than expected), or across values on the depen- The probability depends in part on the complexity
dent variable (e.g., females prefer sandals and and sensitivity of the variables, which are reflected
boots more than expected, but sneakers and shoes in the degrees of freedom of the table from which
less than expected). But some way to measure how the chi-square value is derived.
different the observed results are from the null A table’s degrees of freedom (df) can be
hypothesis is needed. expressed by this formula: df ¼ ðr 1Þðc 1Þ.
148 Chi-Square Test
That is, a table’s degrees of freedom equals the between the variables in the data can be derived
number of rows in the table minus 1, multiplied from a table’s chi-square value.
by the number of columns in the table minus 1. For tables larger than 2 × 2 (such as Table 1),
(For 1 × 2 tables, df ¼ k 1; where k ¼ number a measure called Cramer’s V is derived by the fol-
of values or categories on the variable.) Different lowing formula (where N ¼ the total number of
chi-square thresholds are set for relatively gross observations, and k ¼ the smaller of the number of
comparisons ð1 × 2 or 2 × 2Þ versus finer compari- rows or columns):
sons. (For Table 1, df ¼ ð2 1Þð5 1Þ ¼ 4:)
In a statistics book, the sampling distribution of Cramer’s V ¼ the square root of
chi-square (also known as critical values of chi- ðchi-square divided by ðN times ðk minus 1ÞÞÞ
square) is typically listed in an appendix. The
researcher can read down the column representing So, for (2 × 5) Table 1, Cramer’s V can be com-
his or her previously chosen probability of error puted as follows:
threshold (e.g., p < .05) and across the row repre-
senting the degrees of freedom in his or her table. 1. N(k 1) ¼ 100 (2 1) ¼ 100
If the researcher’s chi-square value is larger than 2. Chi-square/100 ¼ 14.026/100 ¼ 0.14
the critical value in that cell, his or her data repre-
3. Square root of 0.14 ¼ 0.37
sent a statistically significant relationship between
the variables in the table. (In statistics software
The product is interpreted as a Pearson correla-
programs, all the computations are done for the
tion coefficient (r). (For 2 × 2 tables, a measure
researcher, and he or she is given the exact p value
called phi is derived by dividing the table’s
of the results.)
chi-square value by N (the total number of obser-
Table 1’s chi-square value of 14.026, with
vations) and then taking the square root of the
df ¼ 4, is greater than the related critical value of
product. Phi is also interpreted as a Pearson r.)
9.49 (at p ¼ .05), so the null hypothesis can be
Also, r2 is a measure called shared variance.
rejected, and the claim that the male and female
Shared variance is the portion of the total repre-
undergraduates at University X in this sample dif-
sentation of the variables measured in the sample
fer in their (self-reported) footwear preferences
data that is accounted for by the relationship mea-
and that there is some relationship between gender
sured by the chi-square. For Table 1, r2 ¼ .137, so
and footwear preferences (in this sample) can be
approximately 14% of the total footwear prefer-
affirmed.
ence story is accounted for by gender.
A statistically significant chi-square value indi-
A measure of association like Cramer’s V is
cates the degree of confidence a researcher may hold
an important benchmark of just ‘‘how much’’ of
that the relationship between variables in the sample
the phenomenon under investigation has been
is systematic and not attributable to random error.
accounted for. For example, Table 1’s Cramer’s V of
It does not help the researcher to interpret the
0.37 (r2 ¼ .137) means that there are one or more
nature or explanation of that relationship; that must
variables remaining that, cumulatively, account for
be done by other means (including bivariate tabular
at least 86% of footwear preferences. This measure,
analysis and qualitative analysis of the data).
of course, does not begin to address the nature of
the relation(s) among these variables, which is a cru-
cial part of any adequate explanation or theory.
Measures of Association
Jeff Connor-Linton
By itself, statistical significance does not ensure that
the relationship is theoretically or practically impor- See also Correlation; Critical Value; Degrees of Freedom;
tant or even very large. A large enough sample may Dependent Variable; Expected Value; Frequency Table;
demonstrate a statistically significant relationship Hypothesis; Independent Variable; Nominal Scale;
between two variables, but that relationship may Nonparametric Statistics; Ordinal Scale; p Value; R2;
be a trivially weak one. The relative strength of Random Error; Random Sampling; Sampling Error;
association of a statistically significant relationship Variable; Yates’s Correction.
Classical Test Theory 149
Further Readings Spearman in the early 1900s. One of the first texts
to codify the emerging discipline of measurement
Berman, E. M. (2006). Essential statistics for public
managers and policy analysts (2nd ed.). Washington, and CTT was Harold Gulliksen’s Theory of Men-
DC: CQ Press. tal Tests. Much of what Gulliksen presented in
Field, A. (2005). Discovering statistics using SPSS (2nd that text is used unchanged today. Other theories
ed.). London: Sage. of measurement (generalizability theory, item
Hatch, E., & Lazaraton, A. (1991). The research manual: response theory) have emerged that address some
Design and statistics for applied linguistics. Rowley, known weaknesses of CTT (e.g., homoscedasticity
MA: Newbury House. of error along the test score distribution). How-
Siegel, S. (1956). Nonparametric statistics for the ever, the comparative simplicity of CTT and its
behavioral sciences. New York: McGraw-Hill.
continued utility in the development and descrip-
Sokal, R. R., & Rohlf, F. J. (1995). Biometry: The
principles and practice of statistics in biological
tion of assessments have resulted in CTT’s contin-
research (3rd ed.). New York: W. H. Freeman. ued use. Even when other test theories are used,
Welkowitz, J., Ewen, R. B., & Cohen, J. (1982). CTT often remains an essential part of the devel-
Introductory statistics for the behavioral sciences (3rd opment process.
ed.). New York: Academic Press.
Formal Definition
CTT relies on a small set of assumptions. The
CLASSICAL TEST THEORY implications of these assumptions build into the
useful CTT paradigm. The fundamental assump-
tion of CTT is found in the equation
Measurement is the area of quantitative social sci-
ence that is concerned with ascribing numbers to
X ¼ T þ E, ð1Þ
individuals in a meaningful way. Measurement is
distinct from statistics, though measurement theo-
where X represents an observed score, T represents
ries are grounded in applications of statistics.
true score, and E represents error of measurement.
Within measurement there are several theories that
The concept of the true score, T, is often misun-
allow us to talk about the quality of measurements
derstood. A true score, as defined in CTT, does not
taken. Classical test theory (CTT) can arguably be
have any direct connection to the construct that
described as the first formalized theory of measure-
the test is intended to measure. Instead, the true
ment and is still the most commonly used method
score represents the number that is the expected
of describing the characteristics of assessments.
value for an individual based on this specific test.
With test theories, the term test or assessment is
Imagine that a test taker took a 100-item test on
applied widely. Surveys, achievement tests, intelli-
world history. This test taker would get a score,
gence tests, psychological assessments, writing
perhaps an 85. If the test was a multiple-choice
samples graded with rubrics, and innumerable
test, probably some of those 85 points were
other situations in which numbers are assigned to
obtained through guessing. If given the test again,
individuals can all be considered tests. The terms
the test taker might guess better (perhaps obtain-
test, assessment, instrument, and measure are used
ing an 87) or worse (perhaps obtaining an 83).
interchangeably in this discussion of CTT. After
The causes of differences in observed scores are
a brief discussion of the early history of CTT, this
not limited to guessing. They can include anything
entry provides a formal definition and discusses
that might affect performance: a test taker’s state
CTT’s role in reliability and validity.
of being (e.g., being sick), a distraction in the test-
ing environment (e.g., a humming air conditioner),
or careless mistakes (e.g., misreading an essay
Early History
prompt).
Most of the central concepts and techniques asso- Note that the true score is theoretical. It can
ciated with CTT (though it was not called that at never be observed directly. Formally, the true score
the time) were presented in papers by Charles is assumed to be the expected value (i.e., average)
150 Classical Test Theory
of X, the observed score, over an infinite number coefficients allow for the quantification of consis-
of independent administrations. Even if an exam- tency so that decisions about utility can be made.
inee could be given the same test numerous times, T and E cannot be observed directly for indivi-
the administrations would not be independent. duals, but the relative contribution of these compo-
There would be practice effects, or the test taker nents to X is what defines reliability. Instead of
might learn more between administrations. direct observations of these quantities, group-level
Once the true score is understood to be an aver- estimates of their variability are used to arrive at an
age of observed scores, the rest of Equation 1 is estimate of reliability. Reliability is usually defined as
straightforward. A scored performance, X, can be
thought of as deviating from the true score, T, by σ 2T σ2
ρXX0 ¼ 2
¼ T2 þ σ E2 , ð2Þ
some amount of error, E. There are additional σX σT
assumptions related to CTT: T and E are uncorre-
lated ðρTE ¼ 0Þ, error on one test is uncorrelated where X, T, and E are defined as before. From this
with the error on another test ðρE1 E2 ¼ 0Þ, and equation one can see that reliability is the ratio of
error on one test is uncorrelated with the true true score variance to observed score variance
score on another test ðρE1 E2 ¼ 0Þ. These assump- within a sample. Observed score variance is the
tions will not be elaborated on here, but they are only part of this equation that is observable. There
used in the derivation of concepts to be discussed. are essentially three ways to estimate the unob-
served portion of this equation. These three
approaches correspond to three distinct types of
reliability: stability reliability, alternate-form reli-
Reliability
ability, and internal consistency.
If E tends to be large in relation to T, then a test All three types of reliability are based on corre-
result is inconsistent. If E tends to be small in rela- lations that exploit the notion of parallel tests.
tion to T, then a test result is consistent. The major Parallel tests are defined as tests having equal
contribution of CTT is the formalization of this true scores ðT ¼ T 0 Þ and equal error variance
concept of consistency of test scores. In CTT, test ðσ 2E ¼ σ 2E0 Þ and meeting all other assumptions of
score consistency is called reliability. Reliability CTT. Based on these assumptions, it can be derived
provides a framework for thinking about and (though it will not be shown here) that the correla-
quantifying the consistency of a test. Even though tion between parallel tests provides an estimate of
reliability does not directly address constructs, it is reliability (i.e., the ratio of true score variance to
still fundamental to measurement. The often used observed score variance). The correlation among
bathroom scale example makes the point. Does these two scores is the observed reliability coeffi-
a bathroom scale accurately reflect weight? Before cient ðrXX0 Þ. The correlation can be interpreted as
establishing the scale’s accuracy, its consistency the proportion of variance in observed scores that
can be checked. If one were to step on the scale is attributable to true scores; thus rXX0 ¼ r2XT : The
and it said 190 pounds, then step on it again and type of reliability (stability reliability, alternate-
it said 160 pounds, the scale’s lack of utility could form reliability, internal consistency) being esti-
be determined based strictly on its inconsistency. mated is determined by how the notion of a parallel
Now think about a formal assessment. Imagine test is established.
an adolescent was given a graduation exit exam The most straightforward type of reliability is
and she failed, but she was given the same exam stability reliability. Stability reliability (often called
again a day later and she passed. She did not learn test–retest reliability) is estimated by having a rep-
in one night everything she needed in order to resentative sample of the intended testing popula-
graduate. The consequences would have been tion take the same instrument twice. Because the
severe if she had been given only the first opportu- same test is being used at both measurement
nity (when she failed). If this inconsistency opportunities, the notion of parallel tests is easy to
occurred for most test takers, the assessment support. The same test is used on both occasions,
results would have been shown to have insufficient so true scores and error variance should be the
consistency (i.e., reliability) to be useful. Reliability same. Ideally, the two measurement opportunities
Classical Test Theory 151
will be close enough that examinees have not for example, two test halves were found to have
changed (learned or developed on the relevant con- a correlation of .60, the actual reliability would be
struct). You would not, for instance, want to base
your estimates of stability reliability on pretest– 2ð:60Þ 1:20
ρXX0 ¼ ¼ ¼ :75:
intervention–posttest data. The tests should also 1 þ :60 1:60
not be given too close together in time; otherwise,
practice or fatigue effects might influence results. As with all the estimates of reliability, the extent
The appropriate period to wait between testing to which the two measures violate the assumption of
will depend on the construct and purpose of the parallel tests determines the accuracy of the result.
test and may be anywhere from minutes to weeks. One of the drawbacks of using the split-half
Alternate-form reliability requires that each approach is that it does not produce a unique
member of a representative sample respond on result. Other splits of the test will produce (often
two alternate assessments. These alternate forms strikingly) different estimates of the internal con-
should have been built to be purposefully parallel sistency of the test. This problem is ameliorated by
in content and scores produced. The tests should the use of single internal consistency coefficients
be administered as close together as is practical, that provide information similar to split-half reli-
while avoiding fatigue effects. The correlation ability estimates. Coefficient alpha (often called
among these forms represents the alternate-form Cronbach’s alpha) is the most general (and most
reliability. Higher correlations provide more confi- commonly used) estimate of internal consistency.
dence that the tests can be used interchangeably The Kuder–Richardson 20 (KR-20) and 21 (KR-
(comparisons of means and standard deviations 21) are reported sometimes, but these coefficients
will also influence this decision). are special cases of coefficient alpha and do not
Having two administrations of assessments is require separate discussion. The formula for coeffi-
often impractical. Internal consistency reliability cient alpha is
methods require only a single administration. There
are two major approaches to estimating internal con- k
σ 2i
ρXX0 ≥ α ¼ 1 2 , ð4Þ
sistency: split half and coefficient alpha. Split-half k1 σX
estimates are easily understood but have mostly been
supplanted by the use of coefficient alpha. For where k is the number of items, σ 2i is the variance
a split-half approach, a test is split into two halves. of item i, and σ 2X is the variance of test X. Coeffi-
This split is often created by convenience (e.g., all cient alpha is equal to the average of all possible
odd items in one half, all even items in the other split-half methods computed using Phillip Rulon’s
half). The split can also be made more methodically method (which uses information on the variances of
(e.g., balancing test content and item types). Once total scores and the differences between the split-
the splits are obtained, the scores from the two half scores). Coefficient alpha is considered a conser-
halves are correlated. Conceptually, a single test is vative estimate of reliability that can be interpreted
being used to create an estimate of alternate-form as a lower bound to reliability. Alpha is one of the
reliability. For reasons not discussed in this entry, the most commonly reported measures of reliability
correlation from a split-half method will underesti- because it requires only a single test administration.
mate reliability unless it is corrected. Although all three types of reliability are
The Spearman–Brown prophecy formula for based on the notions of parallel tests, they are
predicting the correlation that would have been influenced by distinct types of error and, there-
obtained if each half had been as long as the full- fore, are not interchangeable. Stability reliability
length test is given by measures the extent to which occasions influence
results. Errors associated with this type of reli-
2ρAB ability address how small changes in examinees
ρXX0 ¼ , ð3Þ
1 þ ρAB or the testing environment impact results. Alter-
nate-form reliability addresses how small differ-
where ρAB is the original correlation between the ences in different versions of tests may impact
test halves and ρXX0 is the corrected reliability. If, results. Internal consistency addresses the way
152 Classical Test Theory
in which heterogeneity of items limits the infor- Readers unfamiliar with the current unified under-
mation provided by an assessment. In some standing of validity should consult a more com-
sense, any one coefficient may be an overesti- plete reference; the topics to be addressed here
mate of reliability (even alpha, which is consid- might convey an overly simplified (and positivist)
ered to be conservative). Data collected on each notion of validity.
type of reliability would yield three different Construct validity is the overarching principle
coefficients that may be quite different from one in validity. It asks, Is the correct construct being
another. Researchers should be aware that a reli- measured? One of the principal ways that con-
ability coefficient that included all three sources struct validity is established is by a demonstration
of error at once would likely be lower than that tests are associated with criteria or other tests
a coefficient based on any one source of error. that purport to measure the same (or related) con-
Generalizability theory is an extension of CTT structs. The presence of strong associations (e.g.,
that provides a framework for considering multi- correlations) provides evidence of construct valid-
ple sources of error at once. Generalizability ity. Additionally, research agendas may be estab-
coefficients will tend to be lower than CTT- lished that investigate whether test performance is
based reliability coefficients but may more accu- affected by things (e.g., interventions) that should
rately reflect the amount of error in measure- influence the underlying construct. Clearly, the col-
ments. Barring the use of generalizability theory, lection of construct validity evidence is related to
a practitioner must decide what types of reliabil- general research agendas. When one conducts
ity are relevant to his or her research and make basic or applied research using a quantitative
sure that there is evidence of that type of consis- instrument, two things are generally confounded
tency (i.e., through consultation of appropriate in the results: (1) the validity of the instrument for
reliability coefficients) in the test results. the purpose being employed in the research and
(2) the correctness of the research hypothesis. If
a researcher fails to support his or her research
Validity Investigations and Research hypothesis, there is often difficulty determining
whether the result is due to insufficient validity of
With Classical Test Theory
the assessment results or flaws in the research
CTT is mostly a framework for investigating reli- hypothesis.
ability. Most treatments of CTT also include exten- One of CTT’s largest contributions to our
sive descriptions of validity; however, similar understanding of both validity investigations and
techniques are used in the investigation of validity research in general is the notion of reliability as an
whether or not CTT is the test theory being upper bound to a test’s association with another
employed. In short hand, validity addresses the measure. From Equation 1, observed score vari-
question of whether test results provide the ance is understood to comprise a true score, T, and
intended information. As such, validity evidence is an error term, E. The error term is understood to
primary to any claim of utility. A measure can be be random error. Being that it is random, it cannot
perfectly reliable, but it is useless if the intended correlate with anything else. Therefore, the true
construct is not being measured. Returning to score component is the only component that may
the bathroom scale example, if a scale always have a nonzero correlation with another variable.
describes one adult of average build as 25 pounds The notion of reliability as the ratio of true score
and second adult of average build as 32 pounds, variance to observed score variance (Equation 2)
the scale is reliable. However, the scale is not accu- makes the idea of reliability as an upper bound
rately reflecting the construct weight. explicit. Reliability is the proportion of systematic
Many sources provide guidance about the variance in observed scores. So even if there were
importance of validity and frameworks for the a perfect correlation between a test’s true score
types of data that constitute validity evidence. and a perfectly measured criterion, the observed
More complete treatments can be found in Samuel correlation could be no larger than the square root
Messick’s many writings on the topic. A full treat- of the reliability of the test. If both quantities
ment is beyond the scope of this discussion. involved in the correlation have less than perfect
Clinical Significance 153
CLINICAL SIGNIFICANCE
Group-Dependent Estimates
In treatment outcome research, statistically sig-
Estimates in CTT are highly group dependent. Any nificant changes in symptom severity or end-state
evidence of reliability and validity is useful only if functioning have traditionally been used to dem-
that evidence was collected on a group similar to onstrate treatment efficacy. In more recent stud-
the current target group. A test that is quite reli- ies, the effect size, or magnitude of change
able and valid with one population may not be associated with the experimental intervention,
with another. To make claims about the utility of has also been an important consideration in data
instruments, testing materials (e.g., test manuals or analysis and interpretation. To truly understand
technical reports) must demonstrate the appropri- the impact of a research intervention, it is essen-
ateness and representativeness of the samples used. tial for the investigator to adjust the lens, or
154 Clinical Significance
‘‘zoom out,’’ to also examine other signifiers of Scale scores as well as significantly improved per-
change. Clinical significance is one such marker formance on a computerized test measuring sus-
and refers to the meaningfulness, or impact of an tained attention. The data interpretation is correct
intervention, on clients and others in their social from a statistical point of view. However, the
environment. An intervention that is clinically majority of parents view the intervention as incon-
significant must demonstrate substantial or, at sequential because their children continue to evi-
least, reasonable benefit to the client or others, dence a behavioral problem that disrupts home
such as family members, friends, or coworkers. life. Moreover, most of the treated children see the
Benefits gained, actual or perceived, must be treatment as ‘‘a waste of time’’ because they are
weighed against the costs of the intervention. still being teased or ostracized by peers. Despite
These costs may include financial, time, or fam- significant sustained attention performance
ily burden. Some researchers use the term practi- improvements, classroom teachers also rate the
cal significance as a synonym for clinical experimental intervention as ineffective because
significance because both terms consider the they did not observe meaningful changes in aca-
import of a research finding in everyday life. demic performance.
However, there are differences in the usage of A third scenario is the case of null research
the two terms. Clinical significance is typically findings in which there is also an inconsistency
constrained to treatment outcome or prevention between statistical interpretation and the client
studies whereas practical significance is used or family perspective. For instance, an experi-
broadly across many types of psychological mental treatment outcome study is conducted
research, including cognitive neuroscience, with adult trauma survivors compared with
developmental psychology, environmental psy- treatment as usual. Overall, the treatment-as-
chology, and social psychology. This entry dis- usual group performed superiorly to the new
cusses the difference between statistical and intervention. In fact, on the majority of post-
clinical significance and describes methods for traumatic stress disorder (PTSD) measures, the
measuring clinical significance. experimental group evidenced no statistical
change from pre- to posttest. Given the addi-
tional costs of the experimental intervention, the
Statistical Significance Versus investigators may decide that it is not worth fur-
ther investigation. However, qualitative inter-
Clinical Significance
views are conducted with the participants. The
Statistically significant findings do not always cor- investigators are surprised to learn that most
respond to the client’s phenomenological experi- participants receiving the intervention are highly
ence or overall evaluation of beneficial impact. satisfied although they continue to meet PTSD
First, a research study may have a small effect size diagnostic criteria. The interviews demonstrate
yet reveal statistically significant findings due to clinical significance among the participants, who
high power. This typically occurs when a large perceive a noticeable reduction in the intensity
sample size has been used. Nevertheless, the clini- of daily dissociative symptoms. These partici-
cal significance of the research findings may be pants see the experimental intervention as quite
trivial from the research participants’ perspective. beneficial in terms of facilitating tasks of daily
Second, in other situations, a moderate effect size living and improving their quality of life.
may yield statistically significant results, yet the When planning a research study, the investiga-
pragmatic benefit to the client in his or her every- tor should consider who will evaluate clinical sig-
day life is questionable. For example, children nificance (client, family member, investigator,
diagnosed with attention-deficit/hyperactivity dis- original treating therapist) and what factor(s) are
order may participate in an intervention study important to the evaluator (changes in symptom
designed to increase concentration and reduce dis- severity, functioning, personal distress, emotional
ruptive behavior. The investigators conclude that regulation, coping strategies, social support
the active intervention was beneficial on the basis resources, quality of life, etc.). An investigator
of significant improvement on Connors Ratings should also consider the cultural context and the
Clinical Trial 155
rater’s cultural expertise in the area under exami- comparable to that of a normal comparison group
nation. Otherwise, it may be challenging to with no history of eating disorders, or the study
account for unexpected results. For instance, a 12- may examine whether binge–purge end-state func-
week mindfulness-based cognitive intervention for tioning statistically differs from a group of indivi-
depressed adults is initially interpreted as success- duals currently meeting diagnostic criteria for
ful in improving mindfulness skills, on the basis of bulimia nervosa.
statistically significant improvements on two The Reliable Change Index, developed by Neil
dependent measures: scores on a well-validated Jacobson and colleagues, and equivalence testing
self-report mindfulness measure and attention are the most commonly used comparison method
focus ratings by the Western-trained research strategies. These comparative approaches have sev-
therapists. The investigators are puzzled that most eral limitations, however, and the reader is directed
participants do not perceive mindfulness skill to the Further Readings for more information on
training as beneficial; that is, the training has not the conceptual and methodological issues.
translated into noticeable improvements in depres- The tension between researchers and clinical
sion or daily life functioning. The investigators practitioners will undoubtedly lessen as clinical
request a second evaluation by two mindfulness significance is foregrounded in future treatment
experts—a non-Western meditation practitioner outcome studies. Quantitative and qualitative
versed in traditional mindfulness practices and measurement of clinical significance will be
a highly experienced yoga practitioner. The mind- invaluable in deepening our understanding of
fulness experts independently evaluate the research factors and processes that contribute to client
participants and conclude that culturally relevant transformation.
mindfulness performance markers (postural align-
ment, breath control, and attentional focus) are Carolyn Brodbeck
very weak among participants who received the
See also Effect Size, Measures of; Power; Significance,
mindfulness intervention.
Statistical
Clinical significance may be measured several Beutler, L. E., & Moleiro, C. (2001). Clinical versus
ways, including subjective evaluation, absolute reliable and significant change. Clinical Psychology:
change (did the client evidence a complete return Science and Practice, 8, 441–445.
Jacobson, N. S., Follette, W. C., & Revenstrof, D. (1984).
to premorbid functioning or how much has an
Psychotherapy outcome research: Methods for
individual client changed across the course of
reporting variability and evaluating clinical
treatment without comparison with a reference significance. Behavior Therapy, 15, 335–352.
group), comparison method, or societal impact Jacobson, N. S., & Truax, P. (1991). Clinical significance:
indices. In most studies, the comparison method is A statistical approach to defining meaningful change
the most typically employed strategy for measuring in psychotherapy research. Journal of Consulting 19.
clinical significance. It may be used for examining Jensen, P. S. (2001). Clinical equivalence: A step, misstep,
whether the client group returns to a normative or just a misnomer? Clinical Psychology: Science &
level of symptoms or functioning at the conclusion Practice, 8, 436–440.
of treatment. Alternatively, the comparison Kendall, P. C., Marrs-Garcia, A., Nath, S. R., &
Sheldrick, R. C. (1999). Normative comparisons for
method may be used to determine whether the
the evaluation of clinical significance. Journal of
experimental group statistically differs from an Consulting & Clinical Psychology, 67, 285–299.
impaired group at the conclusion of treatment
even if the experimental group has not returned to
premorbid functioning. For instance, a treatment
outcome study is conducted with adolescents CLINICAL TRIAL
diagnosed with bulimia nervosa. After completing
the intervention, the investigators may consider A clinical trial is a prospective study that involves
whether the level of body image dissatisfaction is human subjects in which an intervention is to be
156 Clinical Trial
evaluated. In a clinical trial, subjects are followed the new drug in order to estimate the maximum
from a well-defined starting point or baseline. tolerated dose.
The goal of a clinical trial is to determine Phase II trials build on the Phase I results in
whether a cause-and-effect relationship exists terms of which dose level or levels warrant further
between the intervention and response. Exam- investigation. Phase II trials are usually fairly
ples of interventions used in clinical trials small-scale trials. In cancer studies, Phase II trials
include drugs, surgery, medical devices, and edu- traditionally involve a single dose with a surrogate
cation and subject management strategies. In end point for mortality, such as change in tumor
each of these cases, clinical trials are conducted volume. The primary comparison of interest is the
to evaluate both the beneficial and harmful effect of the new regimen versus established
effects of the new intervention on human sub- response rates. In other diseases, such as cardiol-
jects before it is made available to the popula- ogy, Phase II trials may involve multiple dose levels
tion of interest. Special considerations for and randomization. The primary goals of a Phase
conducting clinical trials include subject safety II trial are to determine the optimal method of
and informed consent, subject compliance, and administration and examine the potential efficacy
intervention strategies to avoid bias. This entry of a new regimen. Phase II trials generally have
describes the different types of clinical trials and longer follow-up times than do Phase I trials.
discusses ethics in relation to clinical trials. Within Phase II trials, participants are closely
monitored for safety. In addition, pharmacoki-
netic, pharmacodynamic, or pharmacogenomic
studies or a combination are often incorporated as
Drug Development Trials
part of the Phase II trial design. In many settings,
Clinical trials in drug development follow from two or more Phase II trials are undertaken prior to
laboratory experiments, usually involving in a Phase III trial.
vitro experiments or animal studies. The tradi- Phase III trials are undertaken if the Phase II
tional goal of a preclinical study is to obtain pre- trial or trials demonstrate that the drug may be
liminary information on pharmacology and reasonably safe and potentially effective. The pri-
toxicology. Before a new drug may be used in mary goal of a Phase III trial is to compare the
human subjects, several regulatory bodies, such effectiveness of the new treatment with that of
as the internal review board (IRB), Food and either a placebo condition or standard of care.
Drug Administration (FDA), and data safety Phase III trials may involve long-term follow-up
monitoring board, must formally approve the and many participants. The sample size is deter-
study. Clinical trials in drug development are mined using precise statistical methods based on
conducted in a sequential fashion and catego- the end point of interest, the clinically relevant dif-
rized as Phase I, Phase II, Phase III, and Phase IV ference between treatment arms, and control of
trial designs. The details of each phase of a clini- Type I and Type II error rates. The FDA may
cal trial investigation are well defined within require two Phase III trials for approval of new
a document termed a clinical trial protocol. The drugs. The process from drug synthesis to the com-
FDA provides recommendations for the structure pletion of a Phase III trial may take several years.
of Phase I through III trials in several disease Many investigations may never reach the Phase II
areas. or Phase III trial stage.
Phase I trials consist primarily of healthy volun- The gold-standard design in Phase III trials
teer and participant studies. The primary objective employs randomization and is double blind. That
of a Phase I trial is to determine the maximum tol- is, participants who are enrolled in the study agree
erated dose. Other objectives include determining to have their treatment selected by a random or
drug metabolism and bioavailability (how much pseudorandom process, and neither the evaluator
drug reaches the circulation system). Phase I stud- nor the participants have knowledge of the true
ies generally are short-term studies that involve treatment assignment. Appropriate randomization
monitoring toxicities in small cohorts of partici- procedures ensure that the assigned treatment is
pants treated at consistently higher dose levels of independent of any known or unknown prognostic
Clinical Trial 157
benefits of clinical research be distributed fairly. group’s average as the data point. The assumption
An injustice occurs when one group in society is that the group data point is a small representa-
bears a disproportionate burden of research tive of the population of all adolescents. The fact
while another reaps a disproportionate share of that the collection of individuals in the unit serves
its benefit. as the data point, rather than each individual serv-
ing as a data point, differentiates this sampling
Alan Hutson and Mark Brady technique from most others.
An advantage of cluster sampling is that it is
See also Adaptive Designs in Clinical Trials; Ethics in the
a great time saver and relatively efficient in that
Research Process; Group-Sequential Designs in
travel time and other expenses are saved. The pri-
Clinical Trials
mary disadvantage is that one can lose the hetero-
geneity that exists within groups by taking all in
Further Readings the group as a single unit; in other words, the
strategy may introduce sampling error. Cluster
Friedman, L. M., Furberg, C. D., & Demets, D. L.
sampling is also known as geographical sampling
(1998). Fundamentals of clinical trials (3rd ed.). New
York: Springer Science. because areas such as neighborhoods become the
Machin, D., Day, S., & Green, S. (2006). Textbook of unit of analysis.
clinical trials (2nd ed.). Hoboken, NJ: Wiley. An example of cluster sampling can be seen in
Piantadosi, S. (2005). Clinical trials: A methodologic a study by Michael Burton from the University of
perspective (2nd ed.). Hoboken, NJ: Wiley. California and his colleagues, who used both strat-
Senn, S. (2008). Statistical issues in drug development ified and cluster sampling to draw a sample from
(2nd ed.). Hoboken, NJ: Wiley. the United States Census Archives for California in
1880. These researchers emphasized that with little
effort and expended resources, they obtained very
useful knowledge about California in 1880 per-
CLUSTER SAMPLING taining to marriage patterns, migration patterns,
occupational status, and categories of ethnicity.
A variety of sampling strategies are available in Another example is a study by Lawrence T. Lam
cases when setting or context create restrictions. and L. Yang of duration of sleep and attention def-
For example, stratified sampling is used when the icit/hyperactivity disorder among adolescents in
population’s characteristics such as ethnicity or gen- China. The researchers used a variant of simple
der are related to the outcome or dependent vari- cluster sampling, two-stage random cluster sam-
ables being studied. Simple random sampling, in pling design, to assess duration of sleep.
contrast, is used when there is no regard for strata While cluster sampling may not be the first
or defining characteristics of the population from choice given all options, it can be a highly targeted
which the sample is drawn. The assumption is that sampling method when resources are limited and
the differences in these characteristics are normally sampling error is not a significant concern.
distributed across all potential participants.
Cluster sampling is the selection of units of nat- Neil J. Salkind
ural groupings rather than individuals. For exam-
ple, in marketing research, the question at hand See also Convenience Sampling; Experience Sampling
might be how adolescents react to a particular Method; Nonprobability Sampling; Probability
brand of chewing gum. The researcher may access Sampling; Proportional Sampling; Quota Sampling;
such a population through traditional channels Random Sampling; Sampling Error; Stratified
such as the high school but may also visit places Sampling; Systematic Sampling
where these potential participants tend to spend
time together, such as shopping malls and movie
Further Readings
theaters. Rather than counting each one of the
adolescents’ responses to a survey as one data Burton, M., Della Croce, M., Masri, S. A., Bartholomew,
point, the researcher would count the entire M., & Yefremain, A. (2005). Sampling from the
Coefficient Alpha 159
United States Census Archives. Field Methods, 17(1), straighter the first time you were measured. Mea-
102-118. surement error for psychological attributes such as
Ferguson, D. A. (2009). Name-based cluster sampling. preferences, values, attitudes, achievement, and
Sociological Methods & Research, 37(4), 590–598. intelligence can also influence observed scores.
Lam, L. T., & Yang, L. (2008). Duration of sleep and
For example, on the day of a spelling test, a child
ADHD tendency among adolescents in China. Journal
of Attention Disorders, 11, 437–444.
could have a cold that may negatively influence
Pedulla, J. J., & Airasian, P. W. (1980). Sampling from how well she would perform on the test. She
samples: A comparison of strategies in longitudinal may get a 70% on a test when she actually knew
research. Educational & Psychological Measurement, 80% of the material. That is, her observed score
40, 807–813. may be lower than her true score in spelling
achievement. Thus, temporary factors such as
physical health, emotional state of mind, guessing,
outside distractions, misreading answers, or misre-
COEFFICIENT ALPHA cording answers would artificially inflate or deflate
the true scores for a characteristic. Characteristics
Coefficient alpha, or Cronbach’s alpha, is one way of the test or the test administration can also cre-
to quantify reliability and represents the propor- ate measurement error.
tion of observed score variance that is true score Ideally, test users would like to interpret indivi-
variance. Reliability is a property of a test that is dual’s observed scores on a measure to reflect the
derived from true scores, observed scores, and person’s true characteristic, whether it is physical
measurement error. Scores or values that are (e.g., blood pressure, weight) or psychological
obtained from the measurement of some attribute (e.g., knowledge of world history, level of self-
or characteristic of a person (e.g., level of intelli- esteem). In order to evaluate the reliability of
gence, preference for types of foods, spelling scores for any measure, one must estimate the
achievement, body length) are referred to as extent to which individual differences are of func-
observed scores. In contrast, true scores are the tion of the real or true score differences among
scores one would obtain if these characteristics respondents versus the extent to which they are
were measured without any random error. For a function of measurement error. A test that is con-
example, every time you go to the doctor, the sidered reliable minimizes the measurement error
nurse measures your height. That is the observed so that error is not highly correlated with true
height or observed ‘‘score’’ by that particular score. That is, the relationship between the true
nurse. You return for another visit 6 months later, score and observed score should be strong.
and another nurse measures your height. Again, Given the assumptions of classical test theory,
that is an observed score. If you are an adult, it is observed or empirical test scores can be used to
expected that your true height has not changed in estimate measurement error. There are several
the 6 months since you last went to the doctor, but different ways to calculate empirical estimates of
the two values might be different by .5 inch. When reliability. These include test–retest reliability,
measuring the quantity of anything, whether it is alternate- or parallel-forms reliability, and inter-
a physical characteristic such as height or a psycho- nal consistency reliability. To calculate test–
logical characteristic such as food preferences, retest reliability, respondents must take the same
spelling achievement, or level of intelligence, it is test or measurement twice (e.g., the same nurse
expected that the measurement will always be measuring your height at two different points in
unreliable to some extent. That is, there is no per- time). Alternate-forms reliability requires the
fectly reliable measure. Therefore, the observed construction of two tests that measure the same
score is the true score plus some amount of error, set of true scores and have the same amount of
or an error score. error variance (this is theoretically possible, but
Measurement error can come from many differ- difficult in a practical sense). Thus, the respon-
ent sources. For example, one nurse may have dent completes both forms of the test in order to
taken a more careful measurement of your height determine reliability of the measures. Internal
than the other nurse. Or you may have stood up consistency reliability is a practical alternative to
160 Coefficient Alpha
the test–retest and parallel-forms procedures does the split-half method of determining inter-
because the respondents have to complete only nal consistency reliability. Thus, internal consis-
one test at any one time. One form of internal tency reliability is an estimate of how well the
consistency reliability is split-half reliability, in sum score on the items captures the true score
which the items for a measure are divided into on the entire domain from which the items are
two parts or subtests (e.g., odd- and even-num- derived.
bered items), composite scores are computed for Internal consistency, such as measured by Cron-
each subtest, and the two composite scores are bach’s alpha, is a measure of the homogeneity of
correlated to provide an estimate of total test the items. When the various items of an instrument
reliability. (This value is then adjusted by means are measuring the same construct (e.g., depression,
of the Spearman–Brown prophecy formula.) knowledge of subtraction), then scores on the
Split-half reliability is not used very often, items will tend to covary. That is, people will tend
because it is difficult for the two halves of the to score the same way across many items. The
test to meet the criteria of being ‘‘parallel.’’ That items on a test that has adequate or better internal
is, how the test is divided is likely to lead to sub- consistency will be highly intecorrelated. Internal
stantially different estimates of reliability. consistency reliability is the most appropriate type
The most widely used method of estimating of reliability for assessing dynamic traits or traits
reliability is coefficient alpha (α), which is an that change over time, such as test anxiety or
estimate of internal consistency reliability. Lee elated mood.
Cronbach’s often-cited article entitled ‘‘Coeffi-
cient Alpha and the Internal Structure of Tests’’
Calculating Coefficient Alpha
was published in 1951. This coefficient proved
very useful for several reasons. First only one test The split-half method of determining internal con-
administration was required rather than more sistency is based on the assumption that the two
than one, as in test–retest or parallel-forms esti- halves represent parallel subtests and that the cor-
mates of reliability. Second, this formula could relation between the two halves produces the reli-
be applied to dichotomously scored items or ability estimate. For the ‘‘item-level’’ approach,
polytomous items. Finally, it was easy to calcu- such as the coefficient alpha, the logic of the split-
late, at a time before most people had access to half approach is taken further in that each item is
computers, from the statistics learned in a basic viewed as a subtest. Thus, the association between
statistics course. items can be used to represent the reliability of the
Coefficient alpha is also know as the ‘‘raw’’ entire test. The item-level approach is a two-step
coefficient alpha. This method and other meth- process. In the first step, item-level statistics are
ods of determining internal consistency reliabil- calculated (item variances, interitem covariances,
ity (e.g., the generalized Spearman–Brown or interitem correlations). In the second step, the
formula or the standardized alpha estimate) have item-level information is entered into specialized
at least two advantages over the lesser-used split- equations to estimate the reliability of the com-
half method. First, they use more information plete test.
about the test than the split-half method does. Below is the specialized formula for the calcula-
Imagine if a split-half reliability was computed, tion of coefficient alpha. Note that the first step is
and then we randomly divided the items from to determine the variance of scores on the com-
the same sample into another set of split halves plete test.
and recomputed, and kept doing this with all
n SD2 P SD2
possible combinations of split-half estimates of
α ¼ × X i
;
reliability. Cronbach’s alpha is mathematically n1 SD2X
equivalent to all possible split-half estimates,
although it is not computed that way. Second, where n equals the number of components
methods of calculating internal consistency esti- (items or subtests), SD2X is the variance of the
mates require fewer assumptions about the sta- observed total test scores, and SD2i is the variance
tistical properties of the individual items than of component i.
Coefficient Alpha 161
An alternate way to calculate coefficient alpha would be likely to produce higher reliability esti-
is mates. Therefore, to optimize Cronbach’s coeffi-
cient alpha, it would be important to use
Nr a heterogeneous sample, which would have maxi-
α ¼ ;
ðv þ ðN 1Þ rÞ mum true score variability. The sample size is also
likely to influence the magnitude of coefficient
where N equals the number of components, v alpha. This is because a larger size is more likely
is the average variance, and r is the average of to produce a larger variance in the true scores.
all Pearson correlation coefficients between Typically, large numbers of subjects (typically in
components. excess of 200) are needed to obtain generalizable
The methods of calculating coefficient alpha reliability estimates.
as just described use raw scores (i.e., no trans- The easiest way to increase coefficient alpha is
formation has been made to the item scores). by increasing the number of items. Therefore, if,
There is also the standardized coefficient alpha for example, a researcher is developing a new
described in some statistical packages as ‘‘Cron- measure of depression, he or she may want to
bach’s Alpha Based on Standardized Items’’ begin with a large number of items to assess var-
(SPSS [an IBM company, formerly called ious aspects of depression (e.g., depressed mood,
PASWâ Statistics] Reliability Analysis proce- feeling fatigued, loss of interest in favorite activi-
dure) or ‘‘Cronbach Coefficient Alpha for Stan- ties). That is, the total variance becomes larger,
dardized Variables’’ (SAS). This standardized relative to the sum of the variances of the items,
alpha estimate of reliability provides an esti- as the number of items is increased. It can also
mate of reliability for an instrument in which be shown that when the interitem correlations
scores on all items have been standardized to are about the same, alpha approaches one as the
have equal means and standard deviations. The number of items approaches infinity. However, it
use of the raw score formula or the standardized is also true that reliability is expected to be high
score formula often produces roughly similar even when the number of items is relatively
results. small if the correlations among them are high.
For example, a measure with 3 items whose
average intercorrelation is .50 is expected to
Optimizing Coefficient Alpha
have a Cronbach’s alpha coefficient of .75. This
There are four (at least) basic ways to influence same alpha coefficient of .75 can be calculated
the magnitude of Cronbach’s coefficient alpha. from a measure composed of 9 items with an
The first has to with the characteristics of the sam- average intercorrelation among the 9 items of
ple (e.g., homogeneity vs. heterogeneity). Second .25 and of 27 items when the average intercorre-
and third, respectively, are the characteristics lation among them is .10.
of the sample (e.g., size) and the number Selecting ‘‘good’’ items during the construction
of items in the instrument. The final basic way of an instrument is another way to optimize the
occurs during the construction of the instrument. alpha coefficient. That is, scale developers typically
Reliability is sample specific. A homogeneous want items that correlate highly with each other.
sample with reduced true score variability may Most statistical packages provide an item-total
reduce the alpha coefficient. Therefore, if you correlation as well as a calculation of the internal
administered an instrument measuring depression consistency reliability if any single item were
to only those patients recently hospitalized for removed. So one can choose to remove any items
depression, the reliability may be somewhat low that reduce the internal consistency reliability
because all participants in the sample have already coefficient.
been diagnosed with severe depression, and so
there is not likely to be much variability in the
Interpreting Cronbach’s Alpha Coefficient
scores on the measure. However, if one were to
administer the same instrument to a general popu- A reliable test minimizes random measurement
lation, the larger variance in depression scores error so that error is not highly correlated with
162 Coefficient Alpha
the true scores. The relationship between the two or more ‘‘group’’ or common factors under-
true score and the observed scores should be lie the relations among the items. That is, an
strong. A reliability coefficient is the proportion instrument with distinct sets of items or factors
of the observed score variance that is true score can still have an average intercorrelation among
variance. Thus, a coefficient alpha of .70 for the items that is relatively large and would then
a test means that 30% of the variance in scores result in a high alpha coefficient. Also, as dis-
is random and not meaningful. Rules of thumb cussed previously, even if the average intercorre-
exist for interpreting the size of coefficient lation among items is relatively small, the alpha
alphas. Typically, a ‘‘high’’ reliability coefficient can be high if the number of items is relatively
is considered to be .90 or above, ‘‘very good’’ is large. Therefore, a high internal consistency esti-
.80 to .89, and ‘‘good’’ or ‘‘adequate’’ is .70 to mate cannot serve as evidence of the homogene-
.79. Cronbach’s alpha is a lower-bound estimate. ity of a measure. However, a low internal
That is, the actual reliability may be slightly consistency reliability coefficient does mean that
higher. It is also considered to be the most accu- the measure is not homogeneous because the
rate type of reliability estimate within the classi- items do not correlate well together.
cal theory approach, along with the Kuder– It is important to keep several points in mind
Richardson 20, which is used only for dichoto- about reliability when designing or choosing an
mous variables. instrument. First, although reliability reflects
The interpretation of coefficient alpha and variance due to true scores, it does not indicate
other types of reliability depends to some extent what the true scores are measuring. If one uses
on just what is being measured. When tests are a measure with a high internal consistency (.90),
used to make important decisions about people, it may be that the instrument is measuring some-
it would be essential to have high reliability thing different from what is postulated. For
(e.g., .90 or above). For example, individualized example, a personality instrument may be asses-
intelligence tests have high internal consistency. sing social desirability and not the stated con-
Often an intelligence test is used to make impor- struct. Often the name of the instrument
tant final decisions. In contrast, lower reliability indicates the construct being tapped (e.g., the
(e.g., .60 to .80) may be acceptable for looking XYZ Scale of Altruism), but adequate reliability
at group differences in such personality charac- does not mean that the measure is assessing what
teristics as the level of extroversion. it purports to measure (in this example, altru-
It should be noted that the internal consis- ism). That is a validity argument.
tency approach applied through coefficient alpha
assumes that the items, or subparts, of an instru- Karen D. Multon and Jill S. M. Coleman
ment measure the same construct. Broadly
See also Classical Test Theory; ‘‘Coefficient Alpha and
speaking, that means the items are homoge-
the Internal Structure of Tests’’; Correlation;
neous. However, there is no general agreement
Instrumentation; Internal Consistency Reliability;
about what the term homogeneity means and
Pearson Product-Moment Correlation Coefficient;
how it might be measured. Some authors inter-
Reliability; Spearman–Brown Prophecy Formula;
pret it to mean unidimensionality, or having only
‘‘Validity’’
one factor. However, Cronbach did not limit
alpha to an instrument with only one factor. In
fact, he said in his 1951 article, ‘‘Alpha estimates
the proportion of the test variance due to all Further Readings
common factors among items. That is, it reports
Cortina, J. M. (1993). What is coefficient alpha? An
how much the test score depends upon general
examination of theory and application. Journal of
and group rather than item specific factors’’ Applied Psychology, 78, 98–104.
(p. 320). Cronbach’s ‘‘general’’ factor is the first Cronbach, L. J. (1951). Coefficient alpha and the internal
or most important factor, and alpha can be high structure of tests. Psychometrika, 16(3), 297–334.
even if there is no general factor underlying the Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
relations among the items. This will happen if theory (3rd ed.). New York: McGraw-Hill.
‘‘Coefficient Alpha and the Internal Structure of Tests’’ 163
Osburn, H. G. (2000). Coefficient alpha and related test scores. As demonstrated by Cronbach, alpha is
internal consistency reliability coefficients. the mean of all possible split-half coefficients for
Psychological Methods, 5, 343–355. a test. Alpha is generally applicable for studying
measurement consistency whenever data include
multiple observations of individuals (e.g., item
‘‘COEFFICIENT ALPHA AND THE scores, ratings from multiple judges, stability of
performance over multiple trials). Cronbach
INTERNAL STRUCTURE OF TESTS’’ showed that the well-known Kuder–Richardson
formula 20 (KR-20), which preceded alpha, was
Lee Cronbach’s 1951 Psychometrika article ‘‘Coef- a special case of alpha when items are scored
ficient Alpha and the Internal Structure of Tests’’ dichotomously.
established coefficient alpha as the preeminent esti- One sort of internal consistency reliability coef-
mate of internal consistency reliability. Cronbach ficient, the coefficient of precision, estimates the
demonstrated that coefficient alpha is the mean of correlation between a test and a hypothetical repli-
all split-half reliability coefficients and discussed cated administration of the same test when no
the manner in which coefficient alpha should be changes in the examinees have occurred. In con-
interpreted. Specifically, alpha estimates the corre- trast, Cronbach explained that alpha, which esti-
lation between two randomly parallel tests admin- mates the coefficient of equivalence, reflects the
istered at the same time and drawn from correlation between two different k-item tests ran-
a universe of items like those in the original test. domly drawn (without replacement) from a uni-
Further, Cronbach showed that alpha does not verse of items like those in the test and
require the assumption that items be unidimen- administered simultaneously. Since the correlation
sional. In his reflections 50 years later, Cronbach of a test with itself would be higher than the corre-
described how coefficient alpha fits within general- lation between different tests drawn randomly
izability theory, which may be employed to obtain from a pool, alpha provides a lower bound for the
more informative explanations of test score coefficient of precision. Note that alpha (and other
variance. internal consistency reliability coefficients) pro-
Concerns about the accuracy of test scores are vides no information about variation in test scores
commonly addressed by computing reliability that could occur if repeated testings were sepa-
coefficients. An internal consistency reliability rated in time. Thus, some have argued that such
coefficient, which may be obtained from a single coefficients overstate reliability.
test administration, estimates the consistency of Cronbach dismissed the notion that alpha
scores on repeated test administrations taking requires the assumption of item unidimensionality
place at the same time (i.e., no changes in exami- (i.e., all items measure the same aspect of individ-
nees from one test to the next). Split-half ual differences). Instead, alpha provides an esti-
reliability coefficients, which estimate internal mate (lower bound) of the proportion of variance
consistency reliability, were established as a in test scores attributable to all common factors
standard of practice for much of the early 20th accounting for item responses. Thus, alpha can
century, but such coefficients are not unique reasonably be applied to tests typically adminis-
because they depend on particular splits of items tered in educational settings and that comprise
into half tests. Cronbach presented coefficient items that call on several skills or aspects of under-
alpha as an alternative method for estimating standing in different combinations across items.
internal consistency reliability. Alpha is com- Coefficient alpha, then, climaxed 50 years of work
puted as follows: on correlational conceptions of reliability begun
P 2 by Charles Spearman.
k s In a 2004 article published posthumously, ‘‘My
1 i2 i ;
k1 st Current Thoughts on Coefficient Alpha and Suc-
cessor Procedures,’’ Cronbach expressed doubt
where k is the number of items, s2i is the variance that coefficient alpha was the best way of judging
of scores on item i, and s2t is the variance of total reliability. It covered only a small part of the range
164 Coefficient of Concordance
of measurement uses, and consequently it should coefficient alpha article had been cited in more
be viewed within a much larger system of reliabil- than 5,000 publications.
ity analysis, generalizability theory. Moreover,
alpha focused attention on reliability coefficients Jeffrey T. Steedle and Richard J. Shavelson
when that attention should instead be cast on
See also Classical Test Theory; Generalizability Theory;
measurement error and the standard error of
Internal Consistency Reliability; KR-20; Reliability;
measurement.
Split-Half Reliability
For Cronbach, the extension of alpha (and clas-
sical test theory) came when Fisherian notions of
experimental design and analysis of variance were Further Readings
put together with the idea that some ‘‘treatment’’ Brennan, R. L. (2001). Generalizability theory. New
conditions could be considered random samples York: Springer-Verlag.
from a large universe, as alpha assumes about item Cronbach, L. J., & Shavelson, R. J. (2004). My current
sampling. Measurement data, then, could be col- thoughts on coefficient alpha and successor
lected in complex designs with multiple variables procedures. Educational & Psychological
(e.g., items, occasions, and rater effects) and ana- Measurement, 64(3), 391–418.
lyzed with random-effects analysis of variance Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.),
models. The goal was not so much to estimate Educational measurement (pp. 65–110). Westport,
CT: Praeger.
a reliability coefficient as to estimate the compo-
Shavelson, R. J. (2004). Editor’s preface to Lee
nents of variance that arose from multiple vari- J. Cronbach’s ‘‘My Current Thoughts on Coefficient
ables and their interactions in order to account for Alpha and Successor Procedures.’’ Educational &
observed score variance. This approach of parti- Psychological Measurement, 64(3), 389–390.
tioning effects into their variance components pro- Shavelson, R. J., & Webb, N. M. (1991). Generalizability
vides information as to the magnitude of each of theory: A primer. Newbury Park, CA: Sage.
the multiple sources of error and a standard error
of measurement, as well as an ‘‘alpha-like’’ reliabil-
ity coefficient for complex measurement designs.
Moreover, the variance-component approach COEFFICIENT OF CONCORDANCE
can provide the value of ‘‘alpha’’ expected by
increasing or decreasing the number of items (or Proposed by Maurice G. Kendall and Bernard
raters or occasions) like those in the test. In addi- Babington Smith, Kendall’s coefficient of concor-
tion, the proportion of observed score variance dance (W) is a measure of the agreement among
attributable to variance in item difficulty (or, for several (m) quantitative or semiquantitative vari-
example, rater stringency) may also be com- ables that are assessing a set of n objects of inter-
puted, which is especially important to contem- est. In the social sciences, the variables are often
porary testing programs that seek to determine people, called judges, assessing different subjects
whether examinees have achieved an absolute, or situations. In community ecology, they may be
rather than relative, level of proficiency. Once species whose abundances are used to assess habi-
these possibilities were envisioned, coefficient tat quality at study sites. In taxonomy, they may
alpha morphed into generalizability theory, with be characteristics measured over different species,
sophisticated analyses involving crossed and biological populations, or individuals.
nested designs with random and fixed variables There is a close relationship between Milton
(facets) producing variance components for Friedman’s two-way analysis of variance without
multiple measurement facets such as raters and replication by ranks and Kendall’s coefficient of
testing occasions so as to provide a complex concordance. They address hypotheses concerning
standard error of measurement. the same data table, and they use the same χ2 sta-
By all accounts, coefficient alpha—Cronbach’s tistic for testing. They differ only in the formula-
alpha—has been and will continue to be the most tion of their respective null hypothesis. Consider
popular method for estimating behavioral Table 1, which contains illustrative data. In Fried-
measurement reliability. As of 2004, the 1951 man’s test, the null hypothesis is that there is no
Coefficient of Concordance 165
Table 1 Illustrative Example: Ranked Relative Abundances of Four Soil Mite Species (Variables) at 10 Sites (Objects)
Ranks (column-wise) Sum of Ranks
mean (rS ) of the pairwise Spearman correlations rS be included in the index, because different
using the following relationship: groups of species may be associated to different
environmental conditions.
ðm 1ÞrS þ 1
W ¼ , ð4Þ
m Testing the Significance of W
where m is the number of variables (judges) among Friedman’s chi-square statistic is obtained from W
which Spearman correlations are computed. Equa- by the formula
tion 4 is strictly true for untied observations only;
for tied observations, ties are handled in a bivariate χ2 ¼ mðn 1ÞW: ð5Þ
way in each Spearman rS coefficient whereas in
This quantity is asymptotically distributed like
Kendall’s W the correction for ties is computed in
chi-square with ν ¼ ðn 1Þ degrees of freedom; it
a single equation (Equation 3) for all variables.
can be used to test W for significance. According to
For two variables (judges) only, W is simply a lin-
Kendall and Babington Smith, this approach is satis-
ear transformation of rS: W ¼ (rS þ 1)/2. In that
factory only for moderately large values of m and n.
case, a permutation test of W for two variables is
Sidney Siegel and N. John Castellan Jr. recom-
the exact equivalent of a permutation test of rS for
mend the use of a table of critical values for W
the same variables.
when n ≤ 7 and m ≤ 20; otherwise, they recommend
The relationship described by Equation 4 clearly
testing the chi-square statistic (Equation 5) using the
limits the domain of application of the coefficient of
chi-square distribution. Their table of critical values
concordance to variables that are all meant to esti-
of W for small n and m is derived from a table of
mate the same general property of the objects: vari-
critical values of S assembled by Friedman using the
ables are considered concordant only if their
z test of Kendall and Babington Smith and repro-
Spearman correlations are positive. Two variables
duced in Kendall’s classic monograph, Rank Corre-
that give perfectly opposite ranks to a set of objects
lation Methods. Using numerical simulations, Pierre
have a Spearman correlation of 1, hence W ¼ 0
Legendre compared results of the classical chi-
for these two variables (Equation 4); this is the
square test of the chi-square statistic (Equation 5) to
lower bound of the coefficient of concordance. For
the permutation test that Siegel and Castellan also
two variables only, rS ¼ 0 gives W ¼ 0.5. So coeffi-
recommend for small samples (small n). The simula-
cient W applies well to rankings given by a panel of
tion results showed that the classical chi-square test
judges called in to assess overall performance in
was too conservative for any sample size (n) when
sports or quality of wines or food in restaurants, to
the number of variables m was smaller than 20; the
rankings obtained from criteria used in quality tests
test had rejection rates well below the significance
of appliances or services by consumer organizations,
level, so it remained valid. The classical chi-square
and so forth. It does not apply, however, to variables
test had a correct level of Type I error (rejecting
used in multivariate analysis in which negative as
a null hypothesis that is true) for 20 variables and
well as positive relationships are informative. Jerrold
more. The permutation test had a correct rate of
H. Zar, for example, uses wing length, tail length,
Type I error for all values of m and n. The power of
and bill length of birds to illustrate the use of the
the permutation test was higher than that of the
coefficient of concordance. These data are appropri-
classical chi-square test because of the differences in
ate for W because they are all indirect measures of
rates of Type I error between the two tests. The dif-
a common property, the size of the birds.
ferences in power disappeared asymptotically as the
In ecological applications, one can use the
number of variables increased.
abundances of various species as indicators of
An alternative approach is to compute the fol-
the good or bad environmental quality of the
lowing F statistic:
study sites. If a group of species is used to pro-
duce a global index of the overall quality (good F ¼ ðm 1ÞW=ð1 WÞ, ð6Þ
or bad) of the environment at the study sites,
only the species that are significantly associated which is asymptotically distributed like F with
and positively correlated to one another should ν1 ¼ n 1 ð2=mÞ and ν2 ¼ ν1 ðm 1Þ degrees
Coefficient of Concordance 167
of freedom. Kendall and Babington Smith law, consumer protection, etc.). In other types of
described this approach using a Fisher z transfor- studies, scientists are interested in identifying
mation of the F statistic, z ¼ 0.5 loge(F). They variables that agree in their estimation of a com-
recommended it for testing W for moderate values mon property of the objects. This is the case in
of m and n. Numerical simulations show, however, environmental studies in which scientists are
that this F statistic has correct levels of Type I interested in identifying groups of concordant
error for any value of n and m. species that are indicators of some property of
In permutation tests of Kendall’s W, the objects the environment and can be combined into indi-
are the permutable units under the null hypothesis ces of its quality, in particular in situations of
(the objects are sites in Table 1). For the global test pollution or contamination.
of significance, the rank values in all variables are The contribution of individual variables to
permuted at random, independently from variable the W statistic can be assessed by a permutation
to variable because the null hypothesis is the inde- test proposed by Legendre. The null hypothesis
pendence of the rankings produced by all vari- is the monotonic independence of the variable
ables. The alternative hypothesis is that at least subjected to the test, with respect to all the other
one of the variables is concordant with one, or variables in the group under study. The alterna-
with some, of the other variables. Actually, for tive hypothesis is that this variable is concordant
permutation testing, the four statistics SSR with other variables in the set under study, hav-
(Equation 1), W (Equation 2), χ2 (Equation 5), ing similar rankings of values (one-tailed test).
and F (Equation 6) are monotonic to one another The statistic W can be used directly in a poste-
since n and m, as well as T, are constant within riori tests. Contrary to the global test, only the
a given permutation test; thus they are equivalent variable under test is permuted here. If that vari-
statistics for testing, producing the same permuta- able has values that are monotonically indepen-
tional probabilities. The test is one-tailed because dent of the other variables, permuting its values
it recognizes only positive associations between at random should have little influence on the W
vectors of ranks. This may be seen if one considers statistic. If, on the contrary, it is concordant with
two vectors with exactly opposite rankings: They one or several other variables, permuting its
produce a Spearman statistic of 1, hence a value values at random should break the concordance
of zero for W (Equation 4). and induce a noticeable decrease on W.
Many of the problems subjected to Kendall’s Two specific partial concordance statistics can
concordance analysis involve fewer than 20 vari- also be used in a posteriori tests. The first one is the
ables. The chi-square test should be avoided in mean, rj , of the pairwise Spearman correlations
these cases. The F test (Equation 6), as well as the between variable j under test and all the other vari-
permutation test, can safely be used with all values ables. The second statistic, Wj , is obtained by apply-
of m and n. ing Equation 4 to rj instead of r, with m the number
of variables in the group. These two statistics are
shown in Table 2 for the example data; rj and Wj
Contributions of Individual Variables are monotonic to each other because m is constant
in a given permutation test. Within a given a poster-
to Kendall’s Concordance
iori test, W is also monotonic to Wj because only
The overall permutation test of W suggests the values related to variable j are permuted when
a way of testing a posteriori the significance of testing variable j. These three statistics are thus
the contributions of individual variables to the equivalent for a posteriori permutation tests, produc-
overall concordance to determine which of the ing the same permutational probabilities. Like rj , Wj
individual variables are concordant with one or can take negative values; this is not the case of W.
several other variables in the group. There is There are advantages to performing a single
interest in several fields in identifying discordant a posteriori test for variable j instead of (m 1)
variables or judges. This includes all fields that tests of the Spearman correlation coefficients
use panels of judges to assess the overall quality between variable j and all the other variables: The
of the objects or subjects under study (sports, tests of the (m 1) correlation coefficients would
168 Coefficient of Concordance
Table 2 Results of (a) the Overall and (b) the A Posteriori Tests of Concordance Among the Four Species of Table
1; (c) Overall and (d) A Posteriori Tests of Concordance Among Three Species
(a) Overall test of W statistic, four species. H0 : The four species are not concordant with one another.
Kendall’s W ¼ 0.44160 Permutational p value ¼ .0448*
F statistic ¼ 2.37252 F distribution p value ¼ .0440*
Friedman’s chi-square ¼ 15.89771 Chi-square distribution p value ¼ .0690
(b) A posteriori tests, four species. H0 : This species is not concordant with the other three.
rj Wj p Value Corrected p Decision at α ¼ 5%
Species 13 0.32657 0.49493 .0766 .1532 Do not reject H0
Species 14 0.39655 0.54741 .0240 .0720 Do not reject H0
Species 15 0.45704 0.59278 .0051 .0204* Reject H0
Species 23 0.16813 0.12391 .7070 .7070 Do not reject H0
(c) Overall test of W statistic, three species. H0 : The three species are not concordant with one another.
Kendall’s W ¼ 0.78273 Permutational p value ¼ :0005*
F statistic ¼ 7.20497 F distribution p value ¼ :0003*
Friedman’s chi-square ¼ 21.13360 Chi-square distribution p value ¼ .0121*
(d) A posteriori tests, three species. H0 : This species is not concordant with the other two.
rj Wj p Value Corrected p Decision at α ¼ 5%
Species 13 0.69909 0.79939 .0040 .0120* Reject H0
Species 14 0.59176 0.72784 .0290 .0290* Reject H0
Species 15 0.73158 0.82105 .0050 .0120* Reject H0
Source: (a) and (b): Adapted from Legendre, P. (2005). Species associations: The Kendall coefficient of concordance revisited. Journal of
Agricultural, Biological, and Environmental Statistics, 10, 233. Reprinted with permission from the Journal of Agricultural, Biological
and Environmental Statistics. Copyright 2005 by the American Statistical Association. All rights reserved.
Notes: rj ¼ mean of the Spearman correlations with the other species; Wj ¼ partial concordance per species; p value ¼ permutational
probability (9,999 random permutations); corrected p ¼ Holm-corrected p value. * ¼ Reject H0 at α ¼ :05:
have to be corrected for multiple testing, and they The example data are analyzed in Table 2. The
could provide discordant information; a single test overall permutational test of the W statistic is sig-
of the contribution of variable j to the W statistic nificant at α ¼ 5%, but marginally (Table 2a). The
has greater power and provides a single, clearer cause appears when examining the a posteriori
answer. In order to preserve a correct or approxi- tests in Table 2b: Species 23 has a negative mean
mately correct experimentwise error rate, the proba- correlation with the three other species in the
bilities of the a posteriori tests computed for all group (rj ¼ .168). This indicates that Species 23
species in a group should be adjusted for multiple does not belong in that group. Were we analyzing
testing. a large group of variables, we could look at the
A posteriori tests are useful for identifying the next partition in an agglomerative clustering den-
variables that are not concordant with the others, drogram, or the next K-means partition, and pro-
as in the examples, but they do not tell us whether ceed to the overall and a posteriori tests for the
there are one or several groups of congruent vari- members of these new groups. In the present illus-
ables among those for which the null hypothesis of trative example, Species 23 clearly differs from the
independence is rejected. This information can be other three species. We can now test Species 13,
obtained by computing Spearman correlations 14, and 15 as a group. Table 2c shows that this
among the variables and clustering them into group has a highly significant concordance, and all
groups of variables that are significantly and posi- individual species contribute significantly to the
tively correlated. overall concordance of their group (Table 2d).
Coefficient of Variation 169
In Table 2a and 2c, the F test results are concor- Friedman, M. (1940). A comparison of alternative tests
dant with the permutation test results, but due to of significance for the problem of m rankings. Annals
small m and n, the chi-square test lacks power. of Mathematical Statistics, 11, 86–92.
Kendall, M. G. (1948). Rank correlation methods (1st
ed.). London: Charles Griffith.
Kendall, M. G., & Babington Smith, B. (1939). The
Discussion problem of m rankings. Annals of Mathematical
Statistics, 10, 275–287.
The Kendall coefficient of concordance can be Legendre, P. (2005). Species associations: The Kendall
used to assess the degree to which a group of vari- coefficient of concordance revisited. Journal of
ables provides a common ranking for a set of Agricultural, Biological, & Environmental Statistics,
objects. It should be used only to obtain a state- 10, 226–245.
ment about variables that are all meant to measure Zar, J. H. (1999). Biostatistical analysis (4th ed.). Upper
the same general property of the objects. It should Saddle River, NJ: Prentice Hall.
not be used to analyze sets of variables in which
the negative and positive correlations have equal
importance for interpretation. When the null
hypothesis is rejected, one cannot conclude that all COEFFICIENT OF VARIATION
variables are concordant with one another, as
shown in Table 2 (a) and (b); only that at least one The coefficient of variation measures the vari-
variable is concordant with one or some of the ability of a series of numbers independent of the
others. unit of measurement used for these numbers. In
The partial concordance coefficients and a pos- order to do so, the coefficient of variation elimi-
teriori tests of significance are essential comple- nates the unit of measurement of the standard
ments of the overall test of concordance. In several deviation of a series of numbers by dividing the
fields, there is interest in identifying discordant standard deviation by the mean of these num-
variables; this is the case in all fields that use bers. The coefficient of variation can be used to
panels of judges to assess the overall quality of the compare distributions obtained with different
objects under study (e.g., sports, law, consumer units, such as the variability of the weights of
protection). In other applications, one is interested newborns (measured in grams) with the size of
in using the sum of ranks, or the sum of values, adults (measured in centimeters). The coefficient
provided by several variables or judges, to create of variation is meaningful only for measurements
an overall indicator of the response of the objects with a real zero (i.e., ‘‘ratio scales’’) because the
under study. It is advisable to look for one or sev- mean is meaningful (i.e., unique) only for these
eral groups of variables that rank the objects scales. So, for example, it would be meaningless
broadly in the same way, using clustering, and to compute the coefficient of variation of the
then carry out a posteriori tests on the putative temperature measured in degrees Fahrenheit,
members of each group. Only then can their values because changing the measurement to degrees
or ranks be pooled into an overall index. Celsius will not change the temperature but will
change the value of the coefficient of variation
Pierre Legendre (because the value of zero for Celsius is 32 for
Fahrenheit, and therefore the mean of the tem-
See also Friedman Test; Holm’s Sequential Bonferroni
perature will change from one scale to the
Procedure; Spearman Rank Order Correlation
other). In addition, the values of the measure-
ment used to compute the coefficient of variation
are assumed to be always positive or null. The
Further Readings coefficient of variation is primarily a descriptive
Friedman, M. (1937). The use of ranks to avoid the statistic, but it is amenable to statistical infer-
assumption of normality implicit in the analysis of ences such as null hypothesis testing or confi-
variance. Journal of the American Statistical dence intervals. Standard procedures are often
Association, 32, 675–701. very dependent on the normality assumption,
170 Coefficient of Variation
and current work is exploring alternative proce- of variation denoted γ ν An unbiased estimate of the
dures that are less dependent on this normality population coefficient of variation, denoted C^ ν ; is
assumption. computed as
^ν ¼ 1
Definition and Notation C 1þ Cν ð3Þ
4N
The coefficient of variation, denoted Cv (or occa-
sionally V), eliminates the unit of measurement (where N is the sample size).
from the standard deviation of a series of numbers
by dividing it by the mean of this series of num-
Testing the Coefficient of Variation
bers. Formally, if, for a series of N numbers, the
standard deviation and the mean are denoted When the coefficient of variation is computed on
respectively by S and M, the coefficient of varia- a sample drawn from a normal population, its
tion is computed as standard error, denoted σ Cν , is known and is equal
to
S
Cv ¼ : ð1Þ
M γν
σ Cν ¼ pffiffiffiffiffiffiffi ð4Þ
Often the coefficient of variation is expressed as 2N
a percentage, which corresponds to the following
When γ ν is not known (which is, in general,
formula for the coefficient of variation:
the case), σ Cν can be estimated by replacing γ ν
S × 100 by its estimation from the sample. Either Cν or
Cv ¼ : ð2Þ ^ν can be used for this purpose ðC
C ^ ν being prefer-
M
able because it is a better estimate). So σ Cν can
This last formula can be potentially misleading be estimated as
because, as shown later, the value of the coefficient
of variation can exceed 1 and therefore would cre- Cν C^ν
ate percentages larger than 100. In that case, For- SCν ¼ pffiffiffiffiffiffiffi or ^SCν ¼ pffiffiffiffiffiffiffi : ð5Þ
2N 2N
mula 1, which expresses Cv as a ratio rather than
a percentage, should be used. Therefore, under the assumption of normality, the
statistic
Range
Cν γ ν
In a finite sample of N nonnegative numbers with tCν ¼ ð6Þ
SCν
a real zero, the coefficient of variation can take
pffiffiffiffiffiffiffiffiffiffiffiffiffi
a value between 0 and N 1 (the maximum follows a Student distribution with ν ¼ N 1
value of Cv is reached when all values but one are degrees of freedom. It should be stressed that this
equal to zero). test is very sensitive to the normality assumption.
Work is still being done to minimize the effect of
this assumption.
Estimation of a Population
If Equation (6) is rewritten, confidence intervals
Coefficient of Variation
can be computed as
The coefficient of variation computed on a sam-
ple is a biased estimate of the population coefficient Cν ± tα;ν SCν ð7Þ
(with tα;ν being the critical value of Student’s t for Cν ± tα;ν SCν ¼ 0:2000 ± 2:26 × 0:0447
the chosen α level and for ν ¼ N 1 degrees of ð12Þ
freedom). Again, because C ^ ν is a better estimation ¼ 0:200 ± 0:1011
of γ v than Cν is, it makes sense to use C ^ ν rather
than Cν . and therefore we conclude that there is a probabil-
ity of .95 that the value of γ ν lies in the interval
[0.0989 to 0.3011].
Example
Hervé Abdi
Table 1 lists the daily commission in dollars of 10
car salespersons. The mean commission is equal to See also Mean; Standard Deviation; Variability, Measure
$200, with a standard deviation of $40. of; Variance
This gives a value of the coefficient of variation
of Further Readings
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
S 40
Cν ¼ ¼ ¼ 0:200; ð8Þ (2009). Experimental design and analysis
M 200 for psychology. Oxford, UK: Oxford University
Press.
which corresponds to a population estimate of Curto, J. D., & Pinto, J. C. (2009). The coefficient of
variation asymptotic distribution in the case of non-iid
random variables. Journal of Applied Statistics, 36,
^ν ¼ 1 1 21–32.
C 1þ Cν ¼ 1 þ × 0:200
4N 4 × 10 Nairy, K. S., & Rao, K. N. (2003). Tests of coefficients
of variation of normal populations. Commnunications
¼ 0:205:
in Statistics Simulation and Computation, 32,
ð9Þ 641–661.
Martin, J. D., & Gray, L. N. (1971). Measurement of
relative variation: Sociological examples. American
The standard error of the coefficient of variation is Sociological Review, 36, 496–502.
estimated as Sokal, R. R., & Rohlf, F. J. (1995). Biometry (3rd ed.).
New York: Freeman.
Cν 0:200
SCν ¼ pffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:0447 ð10Þ
2N 2 × 10
COEFFICIENTS OF CORRELATION,
(the value of ^
SCν is equal to 0.0458).
ALIENATION, AND DETERMINATION
A t criterion testing the hypothesis that the
population value of the coefficient of variation is The coefficient of correlation evaluates the similar-
equal to zero is equal to ity of two sets of measurements (i.e., two depen-
dent variables) obtained on the same observations.
Cν γ ν 0:2000 The coefficient of correlation indicates the amount
tCν ¼ ¼ ¼ 4:47: ð11Þ of information common to the two variables. This
SCν 0:0447
coefficient takes values between 1 and þ 1
(inclusive). A value of þ 1 shows that the two
This value of tCν ¼ 4:47 is larger than the criti- series of measurements are measuring the same
cal value of tα;ν ¼ 2:26 (which is the critical value thing. A value of 1 indicates that the two mea-
of a Student’s t distribution for α ¼ :05 and surements are measuring the same thing, but one
ν ¼ :9 degrees of freedom). Therefore, we can measurement varies inversely to the other. A value
reject the null hypothesis and conclude that γ ν is of 0 indicates that the two series of measurements
larger than zero. A 95% corresponding confidence have nothing in common. It is important to note
interval gives the values of that the coefficient of correlation measures only
172 Coefficients of Correlation, Alienation, and Determination
the linear relationship between two variables and relationship, and when they have different signs,
that its value is very sensitive to outliers. they indicate a negative relationship.
The squared correlation gives the proportion The average value of the SCPWY is called the
of common variance between two variables and covariance (just like the variance, the covariance
is also called the coefficient of determination. can be computed by dividing by S or by S 1):
Subtracting the coefficient of determination from
unity gives the proportion of variance not shared SCP SCP
covWY ¼ ¼ : ð2Þ
between two variables. This quantity is called Number of Observations S
the coefficient of alienation.
The significance of the coefficient of correla- The covariance reflects the association between
tion can be tested with an F or a t test. This entry the variables, but it is expressed in the original
presents three different approaches that can be units of measurement. In order to eliminate the
used to obtain p values: (1) the classical units, the covariance is normalized by division by
approach, which relies on Fisher’s F distribu- the standard deviation of each variable. This
tions; (2) the Monte Carlo approach, which defines the coefficient of correlation, denoted rW.Y,
relies on computer simulations to derive empiri- which is equal to
cal approximations of sampling distributions;
and (3) the nonparametric permutation (also covWY
rW:Y ¼ : ð3Þ
known as randomization) test, which evaluates σW σY
the likelihood of the actual data against the set
of all possible configurations of these data. In Rewriting the previous formula gives a more prac-
addition to p values, confidence intervals can be tical formula:
computed using Fisher’s Z transform or the more
modern, computationally based, and nonpara- SCPWY
rW:Y ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð4Þ
metric Efron’s bootstrap. SSW SSY
Note that the coefficient of correlation always
overestimates the intensity of the correlation in the where SCP is the sum of the cross-product and
population and needs to be ‘‘corrected’’ in order to SSW and SSY are the sum of squares of W and Y,
provide a better estimation. The corrected value is respectively.
called shrunken or adjusted.
Correlation Computation: An Example
The computation for the coefficient of correlation
is illustrated with the following data, describing
Notations and Definition
the values of W and Y for S ¼ 6 subjects:
Suppose we have S observations, and for each
observation s, we have two measurements, W1 ¼ 1; W2 ¼ 3; W3 ¼ 4; W4 ¼ 4; W5 ¼ 5; W6 ¼ 7
denoted Ws and Ys, with respective means denoted Y1 ¼ 16; Y2 ¼ 10; Y3 ¼ 12; Y4 ¼ 4; Y5 ¼ 8; Y6 ¼ 10:
Mw and My. For each observation, we define the
cross-product as the product of the deviations of
Step 1. Compute the sum of the cross-products.
each variable from its mean. The sum of these
First compute the means of W and Y:
cross-products, denoted SCPwy, is computed as
X
S 1X S
24
MW ¼ Ws ¼ ¼ 4 and
SCPWY ¼ ðWs MW ÞðYs MY Þ: ð1Þ S s¼1 6
s
1X S
60
MY ¼ Ys ¼ ¼ 10:
The sum of the cross-products reflects the asso- S s¼1 6
ciation between the variables. When the deviations
have the same sign, they indicate a positive The sum of the cross-products is then equal to
Coefficients of Correlation, Alienation, and Determination 173
P
X
S ðYs MY ÞðWs MW Þ
SCPWY ¼ ðYs MY ÞðWs MW Þ s SCPWY
rW:Y ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s SSY × SSW SSW SSY
¼ ð16 10Þð1 4Þ 20 20 20
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffi ¼
þ ð10 10Þð3 4Þ 80 × 20 1600 40
þ ð12 10Þð4 4Þ ¼ :5:
þ ð4 10Þð4 4Þ ð8Þ
þ ð8 10Þð5 4Þ ð5Þ
þ ð10 10Þð7 4Þ This value of r ¼ .5 can be interpreted as an
indication of a negative linear relationship between
¼ ð6 × 3Þð0 × 1Þ W and Y.
þ ð2 × 0Þð6 × 0Þ
þ ð2 × 1Þ þ ð0 × 3Þ
Properties of the Coefficient of Correlation
¼ 18 þ 0 þ 0 þ 0 2 þ 0
¼ 20: The coefficient of correlation is a number without
unit. This occurs because dividing the units of the
numerator by the same units in the denominator
Step 2. Compute the sums of squares. The sum
eliminates the units. Hence, the coefficient of cor-
of squares of Ws is obtained as
relation can be used to compare different studies
X
S performed using different variables.
SSW ¼ ðWs MW Þ2 The magnitude of the coefficient of correlation
s¼1 is always smaller than or equal to 1. This happens
2
¼ ð1 4Þ þ ð3 4Þ þ ð4 4Þ
2 2 because the numerator of the coefficient of correla-
tion (see Equation 4) is always smaller than or
þ ð4 4Þ2 þ ð5 4Þ2 þ ð7 4Þ2 equal to its denominator (this property follows
¼ ð3Þ2 þ ð1Þ2 þ 02 þ 02 ð6Þ from the Cauchy–Schwartz inequality). A coeffi-
cient of correlation that is equal to þ 1 or 1
þ 12 þ 32
indicates that the plot of the observations will have
¼ 9þ1þ0þ0þ1þ9 all observations positioned on a line.
¼ 18 þ 0 þ 0 þ 0 2 þ 0 The squared coefficient of correlation gives the
¼ 20: proportion of common variance between two
variables. It is also called the coefficient of deter-
The sum of squares of Ys is mination. In our example, the coefficient of deter-
mination is equal to r2WY ¼ :25: The proportion
X
S
of variance not shared between the variables is
SSY ¼ ðYs MY Þ2
s¼1
called the coefficient of alienation, and for our
example, it is equal to 1 r2WY ¼ :75:
¼ ð16 10Þ2 þ ð10 10Þ2
þ ð12 10Þ2 þ ð4 10Þ2 þ ð8 10Þ2
Interpreting Correlation
þ ð10 10Þ2
Linear and Nonlinear Relationship
¼ 62 þ 02 þ 22 þ ð6Þ2 þ ð2Þ2 þ 02
¼ 36 þ 0 þ 4 þ 36 þ 4 þ 0 The coefficient of correlation measures only lin-
ear relationships between two variables and will
¼ 80: miss nonlinear relationships. For example, Figure 1
ð7Þ displays a perfect nonlinear relationship between
two variables (i.e., the data show a U-shaped rela-
Step 3. Compute rW:Y . The coefficient of corre- tionship with Y being proportional to the square of
lation between W and Y is equal to W), but the coefficient of correlation is equal to 0.
174 Coefficients of Correlation, Alienation, and Determination
Y
Y
W W
Figure 1 A Perfect Nonlinear Relationship With Figure 2 The Dangerous Effect of Outliers on the
a 0 Correlation (rW:Y ¼ 0) Value of the Coefficient of Correlation
Notes: The correlation of the set of points represented by the
circles is equal to .87. When the point represented by the
Effect of Outliers
diamond is added to the set, the correlation is now equal to
Observations far from the center of the distribu- þ .61, which shows that an outlier can determine the value
of the coefficient of correlation.
tion contribute a lot to the sum of the cross-pro-
ducts. In fact, as illustrated in Figure 2, a single
extremely deviant observation (often called an out- France, the number of Catholic churches in a city,
lier) can dramatically influence the value of r. as well as the number of schools, is highly corre-
lated with the number of cases of cirrhosis of the
liver, the number of teenage pregnancies, and the
Geometric Interpretation
number of violent deaths. Does this mean that
Each set of observations can also be seen as churches and schools are sources of vice and that
a vector in an S dimensional space (one dimension newborns are murderers? Here, in fact, the
per observation). Within this framework, the cor- observed correlation is due to a third variable,
relation is equal to the cosine of the angle between namely the size of the cities: the larger a city, the
the two vectors after they have been centered by larger the number of churches, schools, alcoholics,
subtracting their respective mean. For example, and so forth. In this example, the correlation
a coefficient of correlation of r ¼ .50 corre- between number of churches or schools and alco-
sponds to a 150-degree angle. A coefficient of cor- holics is called a spurious correlation because it
relation of 0 corresponds to a right angle, and reflects only their mutual correlation with a third
therefore two uncorrelated variables are called variable (i.e., size of the city).
orthogonal (which is derived from the Greek word
for right angle). Testing the Significance of r
A null hypothesis test for r can be performed using
Correlation and Causation an F statistic obtained as
The fact that two variables are correlated does
not mean that one variable causes the other one: r2
F ¼ × ðS 2Þ: ð9Þ
Correlation is not causation. For example, in 1 r2
Coefficients of Correlation, Alienation, and Determination 175
# of Samples
300 p = .313 150
200 100
α = .05
100 Fcritical = 7.7086 50
0
0 5 10 15 20 25 30 35 40 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Values of F
Values of r 2
F Distribution: Monte Carlo Approach
Figure 3 The Fisher Distribution for ν1 ¼ 1 and 600
F = 1.33
ν2 ¼ 4, Along With α ¼ .05
# of Samples
p = .310
400
Note: Critical value of F ¼ 7:7086.
α = .05
200 Fcritical = 7.5500l
For our example, we find that
0
:25 0 10 20 30 40 50 60 70 80
F ¼ × ð6 2Þ ¼ Values of F
1 :25
:25 1 4
× 4 ¼ × 4 ¼ ¼ 1:33:
:75 3 3 Figure 4 Histogram of Values of r2 and F Computed
From 1,000 Random Samples When the
In order to perform a statistical test, the next Null Hypothesis Is True
step is to evaluate the sampling distribution of the
Notes: The histograms show the empirical distribution of F
F. This sampling distribution provides the proba-
and r2 under the null hypothesis.
bility of finding any given value of the F criterion
(i.e., the p value) under the null hypothesis (i.e.,
pffiffiffi
when there is no correlation between the vari- performed using t ¼ F; which is distributed
ables). If this p value is smaller than the chosen under H0 as a Student’s distribution with
level (e.g., .05 or .01), then the null hypothesis can ν ¼ S 2 degrees of freedom).
be rejected, and r is considered significant. The For our example, the Fisher distribution shown
problem of finding the p value can be addressed in in Figure 3 has ν1 ¼ 1 and ν2 ¼ S 2 ¼ 6 2 ¼ 4
three ways: (1) the classical approach, which uses and gives the sampling distribution of F. The use
Fisher’s F distributions; (2) the Monte Carlo of this distribution will show that the probability
approach, which generates empirical probability of finding a value of F ¼ 1:33 under H0 is equal
distributions; and (3) the (nonparametric) permu- to p ≈ :313 (most statistical packages will rou-
tation test, which evaluates the likelihood of the tinely provide this value). Such a p value does not
actual configuration of results among all other lead to rejecting H0 at the usual level of α ¼ :05
possible configurations of results. or α ¼ :01: An equivalent way of performing a test
uses critical values that correspond to values of F
Classical Approach
whose p value is equal to a given α level. For our
In order to analytically derive the sampling dis- example, the critical value (found in tables avail-
tribution of F, several assumptions need to be able in most standard textbooks) for α ¼ :05 is
made: (a) the error of measurement is added to the equal to Fð1; 4Þ ¼ 7:7086: Any F with a value
true measure; (b) the error is independent of the larger than the critical value leads to rejection of
measure; and (c) the mean error is normally dis- the null hypothesis at the chosen α level, whereas
tributed, has a mean of zero, and has a variance of an F value smaller than the critical value leads
σ 2e : When theses assumptions hold and when the one to fail to reject the null hypothesis. For our
null hypothesis is true, the F statistic is distributed example, because F ¼ 1.33 is smaller than the criti-
as a Fisher’s F with ν1 ¼ 1 and ν2 ¼ S 2 degrees cal value of 7.7086, we cannot reject the null
of freedom. (Incidentally, an equivalent test can be hypothesis.
176 Coefficients of Correlation, Alienation, and Determination
Monte Carlo Approach For our example, we find that 310 random
samples (out of 1,000) had a value of F lar-
A modern alternative to the analytical deriva- ger than F ¼ 1.33, and this corresponds to a prob-
tion of the sampling distribution is to empirically ability of p ¼ .310 (compare with a value of
obtain the sampling distribution of F when the null p ¼ .313 for the classical approach). Because
hypothesis is true. This approach is often called this p value is not smaller than α ¼ :05; we can-
a Monte Carlo approach. not reject the null hypothesis. Using the critical-
With the Monte Carlo approach, we generate value approach leads to the same decision.
a large number of random samples of observations The empirical critical value for α ¼ :05 is equal
(e.g., 1,000 or 10,000) and compute r and F for to 7.5500 (see Figure 4). Because the computed
each sample. In order to generate these samples, value of F ¼ 1.33 is not larger than the 7.5500,
we need to specify the shape of the population we do not reject the null hypothesis.
from which these samples are obtained. Let us use
a normal distribution (this makes the assumptions
for the Monte Carlo approach equivalent to the
Permutation Tests
assumptions of the classical approach). The fre-
quency distribution of these randomly generated For both the Monte Carlo and the traditional
samples provides an estimation of the sampling (i.e., Fisher) approaches, we need to specify
distribution of the statistic of interest (i.e., r or F). the shape of the distribution under the null
For our example, Figure 4 shows the histogram of hypothesis. The Monte Carlo approach can be
the values of r2 and F obtained for 1,000 random used with any distribution (but we need to spec-
samples of 6 observations each. The horizontal ify which one we want), and the classical
axes represent the different values of r2 (top panel) approach assumes a normal distribution. An
and F (bottom panel) obtained for the 1,000 trials, alternative way to look at a null hypothesis test is
and the vertical axis the number of occurrences of to evaluate whether the pattern of results for
each value of r2 and F. For example, the top panel the experiment is a rare event by comparing it to
shows that 160 samples (of the 1,000 trials) have all the other patterns of results that could
a value of r2 ¼ .01, which was between 0 and .01 have arisen from these data. This is called a
(this corresponds to the first bar of the histogram permutation test or sometimes a randomization
in Figure 4). test.
Figure 4 shows that the number of occurrences This nonparametric approach originated with
of a given value of r2 and F decreases as an inverse Student and Ronald Fisher, who developed the
function of their magnitude: The greater the value, (now standard) F approach because it was possible
the less likely it is to obtain it when there is no cor- then to compute one F but very impractical to
relation in the population (i.e., when the null compute all the Fs for all possible permutations. If
hypothesis is true). However, Figure 4 shows also Fisher could have had access to modern compu-
that the probability of obtaining a large value of r2 ters, it is likely that permutation tests would be the
or F is not null. In other words, even when the null standard procedure.
hypothesis is true, very large values of r2 and F can So, in order to perform a permutation test, we
be obtained. need to evaluate the probability of finding the
From now on, this entry focuses on the F distri- value of the statistic of interest (e.g., r or F) that
bution, but everything also applies to the r2 distri- we have obtained, compared with all the values
bution. After the sampling distribution has been we could have obtained by permuting the values
obtained, the Monte Carlo procedure follows the of the sample. For our example, we have six obser-
same steps as the classical approach. Specifically, if vations, and therefore there are
the p value for the criterion is smaller than the
chosen α level, the null hypothesis can be rejected. 6! ¼ 6 × 5 × 4 × 3 × 2 ¼ 720
Equivalently, a value of F larger than the α-level
critical value leads one to reject the null hypothesis different possible patterns of results. Each of these
for this α level. patterns corresponds to a given permutation of the
Coefficients of Correlation, Alienation, and Determination 177
150
When the number of observations is small (as is
100
the case for this example with six observations), it
50
is possible to compute all the possible permutations.
0 In this case we have an exact permutation test. But
0 0.2 0.4 0.6 0.8 1
Values of r 2
the number of permutations grows very fast when
the number of observations increases. For example,
F: Permutation test for ν1 = 1, ν2 = 4
600
with 20 observations the total number of permuta-
F = 1.33 tions is close to 2.4 × 1018 (this is a very big num-
# of Samples
p = .306
400 ber!). Such large numbers obviously prohibit
200
α = .05 computing all the permutations. Therefore, for sam-
Fcritical = 7.7086
ples of large size, we approximate the permutation
0 test by using a large number (say 10,000 or
0 5 10 15 20 25 30 35 40
100,000) of random permutations (this approach is
Values of F
sometimes called a Monte Carlo permutation test).
for the coefficient of correlation will still be large Step 1. Before doing any computation, we need
enough to be impressive. to choose an α level that will correspond to the
The problem of computing the confidence inter- probability of finding the population value of r in
val for r has been explored (once again) by Student the confidence interval. Suppose we chose the
and Fisher. Fisher found that the problem was not value α ¼ :05: This means that we want to obtain
simple but that it could be simplified by transform- a confidence interval such that there is a 95%
ing r into another variable called Z. This transfor- chance, or ð1 αÞ ¼ ð1 :05Þ ¼ :95; of having
mation, which is called Fisher’s Z transform, the population value being in the confidence inter-
creates a new Z variable whose sampling distribu- val that we will compute.
tion is close to the normal distribution. Therefore,
Step 2. Find in the table of the normal distribu-
we can use the normal distribution to compute the
tion the critical values corresponding to the chosen
confidence interval of Z, and this will give a lower
α level. Call this value Zα . The most frequently
and a higher bound for the population values of Z.
used values are
Then we can transform these bounds back into r
values (using the inverse Z transformation), and
Zα¼:10 ¼ 1:645 ðα ¼ :10Þ
this gives a lower and upper bound for the possible
values of r in the population. Zα¼:05 ¼ 1:960 ðα ¼ :05Þ
Zα¼:01 ¼ 2:575 ðα ¼ :01Þ
Fisher’s Z Transform Zα¼001 ¼ 3:325 ðα ¼ 001Þ:
Fisher’s Z transform is applied to a coefficient of
correlation r according to the following formula: Step 3. Transform r into Z using Equation 10.
For the present example, with r ¼ 5, we fnd that
1 Z ¼ 0.5493.
Z ¼ ½lnð1 þ rÞ lnð1 rÞ; ð10Þ
2
Step 4. Compute a quantity called Q as
where ln is the natural logarithm. rffiffiffiffiffiffiffiffiffiffiffi
The inverse transformation, which gives r from 1
Z, is obtained with the following formula: Q ¼ Zα × :
S3
expf2 × Zg 1 For our example we obtain
r ¼ ; ð11Þ
expf2 × Zg þ 2 rffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi
1 1
where expfxg means to raise the number e to the Q ¼ Z:05 × ¼ 1:960 × ¼ 1:1316:
63 3
power fxg (i.e., expfxg ¼ ex and e is Euler’s con-
stant, which is approximately 2.71828). Most Step 5. Compute the lower and upper limits for
hand calculators can be used to compute both Z as
transformations.
Fisher showed that the new Z variable has a sam- Lower Limit ¼ Zlower ¼ Z Q
pling distribution that is normal, with a mean of ¼ 0:5493 1:1316 ¼ 1:6809
0 and a variance of S 3. From this distribution we
can compute directly the upper and lower bounds
of Z and then transform them back into values of r. Upper Limit ¼ Zupper ¼ Z þ Q
¼ 0:5493 þ 1:1316 ¼ 0:5823:
Example
The computation of the confidence interval for Step 6. Transform Zlower and Zupper into rlower
the coefficient of correlation is illustrated using and rupper. This is done with the use of Equation
the previous example, in which we computed a coef- 11. For the present example, we find that
ficient of correlation of r ¼ .5 on a sample made
Lower Limit ¼ rlower ¼ :9330
of S ¼ 6 observations. The procedure can be decom-
posed into six steps, which are detailed next. Upper Limit ¼ rupper ¼ :5243:
Coefficients of Correlation, Alienation, and Determination 179
and the upper limits of the 95% confidence inter- simplified computational formulas). Specifically,
val of the population estimation of rW.Y (cf. the when both variables are ranks (or transformed
values obtained with Fisher’s Z transform of into ranks), we obtain the Spearman rank correla-
.9330 and .5243). Contrary to Fisher’s Z trans- tion coefficient (a related transformation will pro-
form approach, the bootstrap limits are not depen- vide the Kendall rank correlation coefficient);
dent on assumptions about the population or its when both variables are dichotomous (i.e., they
parameters (but it is comforting to see that these take only the values 0 and 1), we obtain the phi
two approaches concur for our example). Because coefficient of correlation; and when only one of
the value of 0 is in the confidence interval of rW.Y, the two variables is dichotomous, we obtain the
we cannot reject the null hypothesis. This shows point-biserial coefficient.
once again that the confidence interval approach
provides more information than the null hypothe- Hervé Abdi and Lynne J. Williams
sis approach.
See also Coefficient of Concordance; Confidence
Intervals
Shrunken and Adjusted r
The coefficient of correlation is a descriptive statis- Further Readings
tic that always overestimates the population corre-
lation. This problem is similar to the problem of Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
(2009). Experimental design and analysis for
the estimation of the variance of a population
psychology. Oxford, UK: Oxford University Press.
from a sample. In order to obtain a better estimate Cohen, J., & Cohen, P. (1983) Applied multiple
of the population, the value r needs to be cor- regression/correlation analysis for the social sciences.
rected. The corrected value of r goes under differ- Hillsdale, NJ: Lawrence Erlbaum.
ent names: corrected r, shrunken r, or adjusted r Darlington, R. B. (1990). Regression and linear models.
(there are some subtle differences between these New York: McGraw-Hill.
different appellations, but we will ignore them Edwards, A. L. (1985). An introduction to linear
here) and denote it by ~r2 : Several correction for- regression and correlation. New York: Freeman.
mulas are available. The one most often used esti- Pedhazur, E. J. (1997). Multiple regression in behavioral
mates the value of the population correlation as research. New York: Harcourt Brace.
2 2 S1
~r ¼ 1 ð1 r Þ : ð12Þ
S2
COHEN’S d STATISTIC
For our example, this gives
Cohen’s d statistic is a type of effect size. An effect
S1 5
2 2
~r ¼ 1 ð1 r Þ ¼ 1 ð1 :25Þ × size is a specific numerical nonzero value used to
S2 4 represent the extent to which a null hypothesis is
5 false. As an effect size, Cohen’s d is typically used
¼ 1 :75 × ¼ 0:06:
4 to represent the magnitude of differences between
two (or more) groups on a given variable, with
With this formula, we find that the estimation of larger values representing a greater differentiation
the population
pffiffiffiffi correlation
pffiffiffiffiffiffiffi drops from r ¼ .50 to between the two groups on that variable. When
~r2 ¼ ~r2 ¼ :06 ¼ :24: comparing means in a scientific study, the report-
ing of an effect size such as Cohen’s d is considered
Particular Cases of the complementary to the reporting of results from
a test of statistical significance. Whereas the test of
Coefficient of Correlation
statistical significance is used to suggest whether
Mostly for historical reasons, some specific cases a null hypothesis is true (no difference exists
of the coefficient of correlation have their own between Populations A and B for a specific phe-
names (in part because these special cases lead to nomenon) or false (a difference exists between
Cohen’s d Statistic 181
Populations A and B for a specific phenomenon), The population means are replaced with sample
the calculation of an effect size estimate is used to means (Y j ), and the population standard deviation
represent the degree of difference between the two is replaced with Sp, the pooled standard deviation
populations in those instances for which the null from the sample. The pooled standard deviation is
hypothesis was deemed false. In cases for which derived by weighing the variance around each
the null hypothesis is false (i.e., rejected), the sample mean by the respective sample size.
results of a test of statistical significance imply that
reliable differences exist between two populations
on the phenomenon of interest, but test outcomes Calculation of the Pooled Standard Deviation
do not provide any value regarding the extent of
that difference. The calculation of Cohen’s d and Although computation of the difference in sam-
its interpretation provide a way to estimate the ple means is straightforward in Equation 2, the
actual size of observed differences between two pooled standard deviation may be calculated in
groups, namely, whether the differences are small, a number of ways. Consistent with the traditional
medium, or large. definition of a standard deviation, this statistic
may be computed as
sffiffiffiffi
by the pooled standard deviation across the j
s2j
Sp ¼ : ð6Þ repeated measures. The same formula may also be
j applied to simple contrasts within repeated mea-
sures designs, as well as interaction contrasts in
or
mixed (between- and within-subjects factors) or
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi split-plot designs. Note, however, that the simple
s21 þ s22
Sp ¼ : ð7Þ application of the pooled standard deviation for-
2 mula does not take into account the correlation
in the case of two groups. between repeated measures. Researchers disagree
Other means of specifying the denominator for as to whether these correlations ought to contrib-
Equation 2 are varied. Some formulas use the aver- ute to effect size computation; one method of
age standard deviation across groups. This proce- determining Cohen’s d while accounting for the
dure disregards differences in sample size in cases correlated nature of repeated measures involves
of unequal n when one is weighing sample var- computing d from a paired t test.
iances and may or may not correct for sample bias
in estimation of the population standard deviation.
Further formulas employ the standard deviation of Additional Means of Calculation
the control or comparison condition (an effect size Beyond the formulas presented above, Cohen’s
referred to as Glass’s ). This method is particu- d may be derived from other statistics, including
larly suited when the introduction of treatment or the Pearson family of correlation coefficients (r), t
other experimental manipulation leads to large tests, and F tests. Derivations from r are particu-
changes in group variance. Finally, more complex larly useful, allowing for translation among vari-
formulas are appropriate when calculating Cohen’s ous effect size indices. Derivations from other
d from data involving cluster randomized or statistics are often necessary when raw data to
nested research designs. The complication partially compute Cohen’s d are unavailable, such as when
arises because of the three available variance statis- conducting a meta-analysis of published data.
tics from which the pooled standard deviation When d is derived as in Equation 3, the following
may be computed: the within-cluster variance, the formulas apply:
between-cluster variance, or the total variance
(combined between- and within-cluster variance). 2r
Researchers must select the variance statistic d ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi , ð8Þ
1 r2
appropriate for the inferences they wish to draw.
tðn1 þ n2 Þ
Expansion Beyond Two-Group Comparisons: d ¼ pffiffiffiffiffipffiffiffiffiffiffiffiffi , ð9Þ
Contrasts and Repeated Measures df n1 n2
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(CI) for the statistic to determine statistical
1 1
d ¼ t þ ð12Þ significance:
n1 n2
CI ¼ d ± zðsd Þ: ð19Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
1 1 The z in the formula corresponds to the z-score
d ¼ F þ : ð13Þ
n1 n2 value on the normal distribution corresponding to
the desired probability level (e.g., 1.96 for a 95%
Again, Equation 13 applies only to instances in CI). Variances and CIs may also be obtained
which the numerator df ¼ 1. through bootstrapping methods.
These formulas must be corrected for the corre-
lation (r) between dependent variables in repeated
measures designs. For example, Equation 12 is cor- Interpretation
rected as follows: Cohen’s d, as a measure of effect size, describes the
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
overlap in the distributions of the compared sam-
ð1 rÞ ð1 rÞ ples on the dependent variable of interest. If the
d ¼ t þ : ð14Þ
n1 n2 two distributions overlap completely, one would
expect no mean difference between them (i.e.,
Finally, conversions between effect sizes com- Y 1 Y 2 ¼ 0). To the extent that the distributions
puted with Equations 3 and 4 may be easily do not overlap, the difference ought to be greater
accomplished: than zero (assuming Y 1 > Y 2 ).
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Cohen’s d may be interpreted in terms of both
n1 þ n2 statistical significance and magnitude, with the lat-
deq3 ¼ deq4 ð15Þ
ðn1 þ n2 2Þ ter the more common interpretation. Effect sizes
are statistically significant when the computed CI
and
does not contain zero. This implies less than perfect
deq3 overlap between the distributions of the two groups
deq4 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð16Þ compared. Moreover, the significance testing
n1 þ n2
ðn1 þ n2 2Þ implies that this difference from zero is reliable, or
not due to chance (excepting Type I errors). While
Variance and Confidence Intervals significance testing of effect sizes is often under-
The estimated variance of Cohen’s d depends on taken, however, interpretation based solely on sta-
how the statistic was originally computed. When tistical significance is not recommended. Statistical
sample bias in the estimation of the population significance is reliant not only on the size of the
pooled standard deviation remains uncorrected effect but also on the size of the sample. Thus, even
(Equation 3), the variance is computed in the fol- large effects may be deemed unreliable when insuf-
lowing manner: ficient sample sizes are utilized.
Interpretation of Cohen’s d based on the magni-
n1 þ n2 d2 n1 þ n2 tude is more common than interpretation based on
sd ¼ þ : statistical significance of the result. The magnitude
n1 n2 2ðn1 þ n2 2Þ n1 þ n2 2
of Cohen’s d indicates the extent of nonoverlap
ð17Þ
between two distributions, or the disparity of the
A simplified formula is employed when sample mean difference from zero. Larger numeric values
bias is corrected as in Equation 4: of Cohen’s d indicate larger effects or greater
differences between the two means. Values may be
2 positive or negative, although the sign merely
n1 þ n2 d
sd ¼ þ : ð18Þ indicates whether the first or second mean in the
n1 n2 2ðn1 þ n2 2Þ
numerator was of greater magnitude (see Equation
Once calculated, the effect size variance 2). Typically, researchers choose to subtract the
may be used to compute a confidence interval smaller mean from the larger, resulting in a positive
184 Cohen’s d Statistic
effect size. As a standardized measure of effect, the to Cohen’s rules of thumb have been proposed.
numeric value of Cohen’s d is interpreted in stan- These include comparisons with effects sizes based
dard deviation units. Thus, an effect size of d ¼ 0.5 on (a) normative data concerning the typical
indicates that two group means are separated by growth, change, or differences between groups
one-half standard deviation or that one group prior to experimental manipulation; (b) those
shows a one-half standard deviation advantage obtained in similar studies and available in the pre-
over the other. vious literature; (c) the gain necessary to attain an
The magnitude of effect sizes is often described a priori criterion; and (d) cost–benefit analyses.
nominally as well as numerically. Jacob Cohen
defined effects as small (d ¼ 0.2), medium
Cohen’s d in Meta-Analyses
(d ¼ 0.5), or large (d ¼ 0.8). These rules of thumb
were derived after surveying the behavioral Cohen’s d, as a measure of effect size, is often used
sciences literature, which included studies in vari- in individual studies to report and interpret the
ous disciplines involving diverse populations, inter- magnitude of between-group differences. It is also
ventions or content under study, and research a common tool used in meta-analyses to aggregate
designs. Cohen, in proposing these benchmarks in effects across different studies, particularly in
a 1988 text, explicitly noted that they are arbitrary meta-analyses involving study of between-group
and thus ought not be viewed as absolute. How- differences, such as treatment studies. A meta-
ever, as occurred with use of .05 as an absolute cri- analysis is a statistical synthesis of results from
terion for establishing statistical significance, independent research studies (selected for inclusion
Cohen’s benchmarks are oftentimes interpreted as based on a set of predefined commonalities), and
absolutes, and as a result, they have been criticized the unit of analysis in the meta-analysis is the data
in recent years as outdated, atheoretical, and inher- used for the independent hypothesis test, including
ently nonmeaningful. These criticisms are espe- sample means and standard deviations, extracted
cially prevalent in applied fields in which medium- from each of the independent studies. The statisti-
to-large effects prove difficult to obtain and smal- cal analyses used in the meta-analysis typically
ler effects are often of great importance. The small involve (a) calculating the Cohen’s d effect size
effect of d ¼ 0.07, for instance, was sufficient for (standardized mean difference) on data available
physicians to begin recommending aspirin as an within each independent study on the target vari-
effective method of preventing heart attacks. Simi- able(s) of interest and (b) combining these individ-
lar small effects are often celebrated in interven- ual summary values to create pooled estimates by
tion and educational research, in which effect sizes means of any one of a variety of approaches (e.g.,
of d ¼ 0.3 to d ¼ 0.4 are the norm. In these fields, Rebecca DerSimonian and Nan Laird’s random
the practical importance of reliable effects is often effects model, which takes into account variations
weighed more heavily than simple magnitude, as among studies on certain parameters). Therefore,
may be the case when adoption of a relatively sim- the methods of the meta-analysis may rely on use
ple educational approach (e.g., discussing vs. not of Cohen’s d as a way to extract and combine data
discussing novel vocabulary words when reading from individual studies. In such meta-analyses, the
storybooks to children) results in effect sizes of reporting of results involves providing average
d ¼ 0.25 (consistent with increases of one-fourth d values (and CIs) as aggregated across studies.
of a standard deviation unit on a standardized In meta-analyses of treatment outcomes in the
measure of vocabulary knowledge). social and behavioral sciences, for instance, effect
Critics of Cohen’s benchmarks assert that such estimates may compare outcomes attributable to
practical or substantive significance is an impor- a given treatment (Treatment X) as extracted from
tant consideration beyond the magnitude and sta- and pooled across multiple studies in relation to
tistical significance of effects. Interpretation of an alternative treatment (Treatment Y) for Out-
effect sizes requires an understanding of the con- come Z using Cohen’s d (e.g., d ¼ 0.21, CI ¼ 0.06,
text in which the effects are derived, including the 1.03). It is important to note that the meaningful-
particular manipulation, population, and depen- ness of this result, in that Treatment X is, on aver-
dent measure(s) under study. Various alternatives age, associated with an improvement of about
Cohen’s f Statistic 185
one-fifth of a standard deviation unit for Outcome means increases relative to the average standard
Z relative to Treatment Y, must be interpreted in deviation within each group. Jacob Cohen has sug-
reference to many factors to determine the actual gested that the values of 0.10, 0.25, and 0.40
significance of this outcome. Researchers must, at represent small, medium, and large effect sizes,
the least, consider whether the one-fifth of a stan- respectively.
dard deviation unit improvement in the outcome
attributable to Treatment X has any practical
Calculation
significance.
Cohen’s f is calculated as
Shayne B. Piasta and Laura M. Justice
f ¼ σ m =σ; ð1Þ
See also Analysis of Variance (ANOVA); Effect Size,
Measures of; Mean Comparisons; Meta-Analysis;
where σ m is the standard deviation (SD) of popula-
Statistical Power Analysis for the Behavioral Sciences
tion means (mi) represented by the samples and σ is
the common within-population SD; σ ¼ MSE1=2 .
Further Readings MSE is the mean square of error (within groups)
from the overall ANOVA F test. It is based on the
Cohen, J. (1988). Statistical power analysis for the deviation of the population means from the mean of
behavioral sciences (2nd ed.). Mahwah, NJ: Lawrence
the combined populations or the mean of the means
Erlbaum.
Cooper, H., & Hedges, L. V. (1994). The handbook of
(M).
research synthesis. New York: Russell Sage hX i1=2
Foundation. σm ¼ ðmi MÞ2=k ð2Þ
Hedges, L. V. (2007). Effect sizes in cluster-randomized
designs. Journal of Educational & Behavioral
Statistics, 32, 341–370. for equal sample sizes and
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. hX i1=2
(2007, July). Empirical benchmarks for interpreting σm ¼ ni ðmi MÞ2=N ð3Þ
effect sizes in research. New York: MDRC.
Ray, J. W., & Shadish, W. R. (1996). How
interchangeable are different estimators of effect size? for unequal sample sizes.
Journal of Consulting & Clinical Psychology, 64,
1316–1325.
Wilkinson, L., & APA Task Force on Statistical Inference. Examples
(1999). Statistical methods in psychology journals: Example 1
Guidelines and explanations. American Psychologist,
54, 594–604. Table 1 provides descriptive statistics for a study
with four groups and equal sample sizes. ANOVA
results are shown. The calculations below result in
an estimated f effect size of .53, which is consid-
COHEN’S f STATISTIC ered large by Cohen standards. An appropriate
interpretation is that about 50% of the variance in
Effect size is a measure of the strength of the rela- the dependent variable (physical health) is
tionship between variables. Cohen’s f statistic is explained by the independent variable (presence or
one appropriate effect size index to use for a one- absence of mental or physical illnesses at age 16).
way analysis of variance (ANOVA). Cohen’s f is hX i1=2
a measure of a kind of standardized average effect σm ¼ ðmi MÞ2=k ¼ ½ðð71:88 62:74Þ2
in the population across all the levels of the inde-
pendent variable. þ ð66:08 62:74Þ2 þ ð58:44 62:74Þ2
Cohen’s f can take on values between zero, þ 54:58 62:74Þ Þ=4
2 1=2
¼ 6:70
when the population means are all equal, and an
indefinitely large number as standard deviation of f ¼ σ m =σ ¼ 6:70=161:291=2 ¼ 6:70=12:7 ¼ 0:53
186 Cohen’s f Statistic
Table 1 Association of Mental Disorders and Physical Illnesses at a Mean Age of 16 Years With Physical Health at
a Mean Age of 33 Years: Equal Sample Sizes
Group n M SD
Reference group (no disorders) 80 71.88 5.78
Mental disorder only 80 66.08 19.00
Physical illness only 80 58.44 8.21
Physical illness and mental disorder 80 54.58 13.54
Total 320 62.74 14.31
Analysis of Variance
Source Sum of Squares df Mean Square F p
Between groups 14,379.93 3 4,793.31 29.72 0.00
Within groups 50,967.54 316 161.29
Total 65,347.47 319
Source: Adapted from Chen, H., Cohen, P., Kasen, S., Johnson, J. G., Berenson, K., & Gordon, K. (2006). Impact of
adolescent mental disorders and physical illnesses on quality of life 17 years later. Archives of Pediatrics & Adolescent
Medicine, 160, 93–99.
Table 2 Association of Mental Disorders and Physical Illnesses at a Mean Age of 16 Years With Physical Health at
a Mean Age of 33 Years: Unequal Sample Sizes
Group n M SD
Reference group (no disorders) 256 72.25 17.13
Mental disorder only 89 68.16 21.19
Physical illness only 167 66.68 18.58
Physical illness and mental disorder 96 57.67 18.86
Total 608 67.82 19.06
Analysis of Variance
Source Sum of Squares df Mean Square F p
Between groups 15,140.20 3 5,046.73 14.84 0.00
Within groups 205,445.17 604 340.14
Total 220,585.37 607
Source: Adapted from Chen, H., Cohen, P., Kasen, S., Johnson, J. G., Berenson, K., & Gordon, K. (2006). Impact of
adolescent mental disorders and physical illnesses on quality of life 17 years later. Archives of Pediatrics & Adolescent
Medicine, 160, 93–99.
Note: Adapted from total sample size of 608 by choosing 80 subjects for each group.
Cohen’s f and d
Cohen’s f is an extension of Cohen’s d, which is
COHEN’S KAPPA
the appropriate measure of effect size to use for a t
test. Cohen’s d is the difference between two group Cohen’s Kappa coefficient () is a statistical mea-
means divided by the pooled SD for the two sure of the degree of agreement or concordance
groups. The relationship between f and d when between two independent raters that takes into
one is comparing two means (equal sample sizes) account the possibility that agreement could occur
is d ¼ 2f. If Cohen’s f ¼ 0.1, the SD of kðk ≥ 2Þ by chance alone.
population means is one tenth as large as the SD Like other measures of interrater agreement,
of the observations within the populations. For is used to assess the reliability of different raters or
k ¼ two populations, this effect size indicates measurement methods by quantifying their consis-
a small difference between the two populations: tency in placing individuals or items in two or
d ¼ 2f ¼ 2 * 0.10 ¼ 0.2. more mutually exclusive categories. For instance,
Cohen’s f in Equation 1 is positively biased in a study of developmental delay, two pediatri-
because the sample means in Equation 2 or 3 are cians may independently assess a group of toddlers
likely to vary more than do the population means. and classify them with respect to their language
One can use the following equation from Scott development into either ‘‘delayed for age’’ or ‘‘not
Maxwell and Harold Delaney to calculate an delayed.’’ One important aspect of the utility of
adjusted Cohen’s f: this classification is the presence of good agree-
ment between the two raters. Agreement between
1=2 two raters could be simply estimated as the per-
fadj ¼ ½ðk 1ÞðF 1Þ=N ð4Þ centage of cases in which both raters agreed. How-
ever, a certain degree of agreement is expected by
Applying Equation 4 to the data in Table 1 chance alone. In other words, two raters could still
yields agree on some occasions even if they were ran-
domly assigning individuals into either category.
fadj ¼ ½ðk 1ÞðF 1Þ=N
1=2 In situations in which there are two raters and
the categories used in the classification system have
¼ ½ð4 1Þð29:72 1Þ=3201=2 ¼ 0:52: no natural order (e.g., delayed vs. not delayed;
present vs. absent), Cohen’s can be used to quan-
tify the degree of agreement in the assignment of
For Table 2,
these categories beyond what would be expected
by random guessing or chance alone.
1=2
fadj ¼ ½ðk 1ÞðF 1Þ=N
Table 1 Data From the Hypothetical Study Described in the Text: Results of Assessments of Developmental Delay
Made by Two Pediatricians
Rater 2
Table 2 Landis and Koch Interpretation of Cohen’s questions that may arise in a reliability study. For
Cohen’s Kappa Degree of Agreement instance, it might be of interest to determine
< 0.20 Poor whether disagreement between the two pediatri-
0.21–0.40 Fair cians in the above example was more likely to
0.41–0.60 Moderate occur when diagnosing developmental delay than
0.61–0.80 Good when diagnosing normal development or vice
0.81–1.00 Very good versa. However, cannot be used to address this
Source: Landis & Koch, 1977.
question, and alternative measures of agreement
are needed for that purpose.
See also Interrater Reliability; Reliability much of the early progress in understanding occu-
pational diseases. Cohort studies based on data
Further Readings derived from company records and vital records
led to the identification of many environmental
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha,
and occupational risk factors. Several major
D. (1999). Beyond kappa: A review of interrater
agreement measures. Canadian Journal of Statistics,
cohort studies with follow-up that spanned dec-
27, 3–23. ades have made significant contributions to our
Cohen, J. (1960). A coefficient of agreement for nominal understanding of the causes of several common
scales. Educational & Psychological Measurement, 1, chronic diseases. Examples include the Framing-
37–46. ham Heart Study, the Tecumseh Community
Fleiss, J. L. (1981). Statistical methods for rates and Health Study, the British Doctors Study, and the
proportions (2nd ed.). New York: Wiley. Nurses’ Health Study.
Gwet, K. (2002). Inter-rater reliability: Dependency on In a classic cohort study, individuals who are
trait prevalence and marginal homogeneity. Statistical initially free of the disease being researched are
Methods for Inter-Rater Reliability Assessment Series,
enrolled into the study, and individuals are each
2, 1–9.
Landis, J. R., & Koch, G. G. (1977). The measurement
categorized into one of two groups according to
of observer agreement for categorical data. Biometrics, whether they have been exposed to the suspected
33, 159–174. risk factor. One group, called the exposed group,
includes individuals known to have the charac-
teristic or risk factor under study. For instance,
in a cohort study of the effect of smoking on
COHORT DESIGN lung cancer, the exposed group may consist of
known smokers. The second group, the unex-
In epidemiology, a cohort design or cohort study is posed group, will comprise a comparable group
a nonexperimental study design that involves com- of individuals who are also free of the disease
paring the occurrence of a disease or condition in initially but are nonsmokers. Both groups are
two or more groups (or cohorts) of people that dif- then followed up for a predetermined period of
fer on a certain characteristic, risk factor, or expo- time or until the occurrence of disease or death.
sure. The disease, state, or condition under study Cases of the disease (lung cancer in this instance)
is often referred to as the outcome, whereas the occurring among both groups are identified in
characteristic, risk factor, or exposure is often the same way for both groups. The number of
referred to as the exposure. A cohort study is one people diagnosed with the disease in the exposed
of two principal types of nonexperimental study group is compared with that among the unex-
designs used to study the causes of disease. The posed group to estimate the relative risk of dis-
other is the case–control design, in which cases of ease due to the exposure or risk factor. This type
the disease under study are compared with respect of design is sometimes called a prospective
to their past exposure with a similar group of indi- cohort study.
viduals who do not have the disease. In a retrospective (historical) cohort study, the
Cohort (from the Latin cohors, originally a unit researchers use existing records or electronic data-
of a Roman legion) is the term used in epidemiol- bases to identify individuals who were exposed at
ogy to refer to a group of individuals who share a certain point in the past and then ‘‘follow’’ them
a common characteristic; for example, they may all up to the present. For instance, to study the effect
belong to the same ethnic or age group or be of exposure to radiation on cancer occurrence
exposed to the same risk factor (e.g., radiation or among workers in a uranium mine, the researcher
soil pollution). may use employee radiation exposure records to
The cohort study is a relatively recent innova- categorize workers into those who were exposed
tion. The first cohort studies were used to confirm to radiation and those who were not at a certain
the link between smoking and lung cancer that date in the past (e.g., 10 years ago). The medical
had been observed initially in earlier case–control records of each employee are then searched to
studies. Cohort studies also formed the basis for identify those employees who were diagnosed with
190 Cohort Design
cancer from that date onward. Like prospective the cohort study design does not usually involve
cohort designs, the frequency of occurrence of the manipulating the exposure under study in any
disease in the exposed group is compared with that way that changes the exposure status of the
within the unexposed group in order to estimate participants.
the relative risk of disease due to radiation expo-
sure. When accurate and comprehensive records
Advantages and Disadvantages
are available, this approach could save both time
and money. But unlike the classic cohort design, in Cohort studies are used instead of experimental
which information is collected prospectively, the study designs, such as clinical trials, when experi-
researcher employing a retrospective cohort design ments are not feasible for practical or ethical rea-
has little control over the quality and availability sons, such as when investigating the effects of
of information. a potential cause of disease.
Cohort studies could also be classified as In contrast to case–control studies, the design of
closed or open cohort studies. In a closed cohort cohort studies is intuitive, and their results are eas-
study, cohort membership is decided at the onset ier to understand by nonspecialists. Furthermore,
of the study, and no additional participants are the temporal sequence of events in a cohort study
allowed to join the cohort once the study starts. is clear because it is always known that the expo-
For example, in the landmark British Doctors sure has occurred before the disease. In case–con-
Study, participants were male doctors who were trol and cross-sectional studies, it is often unclear
registered for medical practice in the United whether the suspected exposure has led to the dis-
Kingdom in 1951. This cohort was followed up ease or the other way around.
with periodic surveys until 2001. This study pro- In prospective cohort studies, the investigator
vided strong evidence for the link between smok- has more control over what information is to be
ing and several chronic diseases, including lung collected and at what intervals. As a result, the
cancer. prospective cohort design is well suited for study-
In an open (dynamic) cohort study, the cohort ing chronic diseases because it permits fuller
membership may change over time as additional understanding of the disease’s natural history.
participants are permitted to join the cohort, and In addition, cohort studies are typically better
they are followed up in a fashion similar to that of than case–control studies in studying rare expo-
the original participants. For instance, in a prospec- sures. For instance, case–control designs are not
tive study of the effects of radiation on cancer practical in studying occupational exposures that
occurrence among uranium miners, newly are rare in the general population, as when expo-
recruited miners are enrolled in the study cohort sure is limited to a small cohort of workers in
and are followed up in the same way as those a particular industry. Another advantage of cohort
miners who were enrolled at the inception of the studies is that multiple diseases and conditions
cohort study. related to the same exposure could be easily exam-
Regardless of their type, cohort studies are dis- ined in one study.
tinguished from other epidemiological study On the other hand, cohort studies tend to be
designs by having all the following features: more expensive and take longer to complete than
other nonexperimental designs. Generally, case–
control and retrospective cohort studies are more
• The study group or groups are observed over
efficient and less expensive than the prospective
time for the occurrence of the study outcome.
• The study group or groups are defined on the
cohort studies. The prospective cohort design is
basis of whether they have the exposure at the
not suited for the study of rare diseases, because
start or during the observation period before the prospective cohort studies require following up
occurrence of the outcome. Therefore, in a large number of individuals for a long time.
a cohort study, it is always clear that the Maintaining participation of study subjects over
exposure has occurred before the outcome. time is a challenge, and selective dropout from the
• Cohort studies are observational or study (or loss to follow-up) may result in biased
nonexperimental studies. Unlike clinical trials, results. Because of lack of randomization, cohort
Cohort Design 191
studies are more potentially subject to bias and information on the exposure(s) under investiga-
confounding than experimental studies are. tion. In addition, information on demographic and
socioeconomic factors (e.g., age, gender, and occu-
pation) is often collected. As in all observational
Design and Implementation
studies, information on potential confounders,
The specifics of cohort study design and implemen- factors associated with both the exposure and out-
tation depend on the aim of the study and the come under study that could confuse the interpre-
nature of the risk factors and diseases under study. tation of the results, is also collected. Depending
However, most prospective cohort studies begin by on the type of the exposure under study, the study
assembling one or more groups of individuals. design may also include medical examinations of
Often members of each group share a well-defined study participants, which may include clinical
characteristic or exposure. For instance, a cohort assessment (e.g., measuring blood pressure), labo-
study of the health effects of uranium exposure ratory testing (e.g., measuring blood sugar levels
may begin by recruiting all uranium miners or testing for evidence for an infection with a cer-
employed by the same company, whereas a cohort tain infectious agent), or radiological examinations
study of the health effects of exposure to soil pol- (e.g., chest x-rays). In some studies, biological
lutants from a landfill may include all people living specimens (e.g., blood or serum specimens) are col-
within a certain distance from the landfill. Several lected and stored for future testing.
major cohort studies have recruited all people born Follow-up procedures and intervals are also
in the same year in a city or province (birth important design considerations. The primary aim
cohorts). Others have included all members of of follow-up is to determine whether participants
a professional group (e.g., physicians or nurses), developed the outcome under study, although
regardless of where they lived or worked. Yet most cohort studies also collect additional infor-
others were based on a random sample of the mation on exposure and confounders to determine
population. Cohort studies of the natural history changes in exposure status (e.g., a smoker who
of disease may include all people diagnosed with quits smoking) and other relevant outcomes (e.g.,
a precursor or an early form of the disease and development of other diseases or death). As with
then followed up as their disease progressed. exposures, the method of collecting information
The next step is to gather information on the on outcomes depends on the type of outcome and
exposure under investigation. Cohort studies can the degree of desired diagnostic accuracy. Often,
be used to examine exposure to external agents, mailed questionnaires and phone interviews are
such as radiation, second-hand smoke, an infec- used to track participants and determine whether
tious agent, or a toxin. But they can also be used they developed the disease under study. For certain
to study the health effects of internal states (e.g., types of outcome (e.g., death or development of
possession of a certain gene), habits (e.g., smoking cancer), existing vital records (e.g., the National
or physical inactivity), or other characteristics Death Index in the United States) or cancer regis-
(e.g., level of income or educational status). The tration databases could be used to identify study
choice of the appropriate exposure measurement participants who died or developed cancer. Some-
method is an important design decision and times, in-person interviews and clinic visits are
depends on many factors, including the accuracy required to accurately determine whether a partici-
and reliability of the available measurement meth- pant has developed the outcome, as in the case of
ods, the feasibility of using these methods to mea- studies of the incidence of often asymptomatic dis-
sure the exposure for all study participants, and eases such as hypertension or HIV infection.
the cost. In certain designs, called repeated measure-
In most prospective cohort studies, baseline ments designs, the above measurements are per-
information is collected on all participants as they formed more than once for each participant.
join the cohort, typically using self-administered Examples include pre-post exposure studies, in
questionnaires or phone or in-person interviews. which an assessment such as blood pressure mea-
The nature of the collected information depends surement is made before and after an intervention
on the aim of the study but often includes detailed such as the administration of an antihypertensive
192 Cohort Design
medication. Pre-post designs are more commonly if the remaining cohort members differ from those
used in experimental studies or clinical trials, but who were lost to follow-up with respect to the
there are occasions where this design can be used exposure under study. For instance, in a study of
in observational cohort studies. For instance, the effects of smoking on dementia, smoking may
results of hearing tests performed during routine misleadingly appear to reduce the risk of dementia
preemployment medical examinations can be com- because smokers are more likely than nonsmokers
pared with results from hearing tests performed to die at younger age, before dementia could be
after a certain period of employment to assess the diagnosed.
effect of working in a noisy workplace on hearing
acuity.
In longitudinal repeated measurements designs, Analysis of Data
typically two or more exposure (and outcome)
measurements are performed over time. These Compared with other observational epidemiologic
studies tend to be observational and could there- designs, cohort studies provide data permitting the
fore be carried out prospectively or less commonly calculations of several types of disease occurrence
retrospectively using precollected data. These stud- measures, including disease prevalence, incidence,
ies are ideal for the study of complex phenomena and cumulative incidence. Typically, disease inci-
such as the natural history of chronic diseases, dence rates are calculated separately for each of
including cancer. The repeated measurements the exposed and the unexposed study groups. The
allow the investigator to relate changes in time- ratio between these rates, the rate ratio, is then
dependent exposures to the dynamic status of the used to estimate the degree of increased risk of the
disease or condition under study. This is especially disease due to the exposure.
valuable if exposures are transient and may not be In practice, more sophisticated statistical meth-
measurable by the time the disease is detected. For ods are needed to account for the lack of randomi-
instance, repeated measurements are often used in zation. These methods include direct or indirect
longitudinal studies to examine the natural history standardization, commonly used to account for
of cervical cancer as it relates to infections with differences in the age or gender composition of the
the human papilloma virus. In such studies, parti- exposed and unexposed groups. Poisson regression
cipants are typically followed up for years with could be used to account for differences in one or
prescheduled clinic visits at certain intervals (e.g., more confounders. Alternatively, life table and
every 6 months). At each visit, participants are other survival analysis methods, including Cox
examined for evidence of infection with human proportional hazard models, could be used to ana-
papilloma virus or development of cervical cancer. lyze data from cohort studies.
Because of the frequent testing for these condi-
tions, it is possible to acquire a deeper understand-
ing of the complex sequence of events that Special Types
terminates with the development of cancer.
Nested Case–Control Studies
One important goal in all cohort studies is to
minimize voluntary loss to follow-up due to parti- In a nested case–control study, cases and con-
cipants’ dropping out of the study or due to trols are sampled from a preexisting and usually
researchers’ failure to locate and contact all parti- well-defined cohort. Typically, all subjects who
cipants. The longer the study takes to complete, develop the outcome under study during follow-
the more likely that a significant proportion of the up are included as cases. The investigator then
study participants will be lost to follow-up because randomly samples a subset of noncases (subjects
of voluntary or involuntary reasons (e.g., death, who did not develop the outcome at the time
migration, or development of other diseases). of diagnosis of cases) as controls. Nested case–
Regardless of the reasons, loss to follow-up is control studies are more efficient than cohort
costly because it reduces the study’s sample size studies because exposures are measured only for
and therefore its statistical power. More impor- cases and a subset of noncases rather than for all
tant, loss to follow-up could bias the study results members of the cohort.
Collinearity 193
change in magnitude or even a reversal in sign in tool are that it does not illuminate the nature
one regression coefficient after another predictor of the collinearity, which is problematic if the
variable is added to the model or specific observa- collinearity is between more than two variables,
tions are excluded from the model. It is especially and it does not consider collinearity with the
important for inference that a possible conse- intercept. A diagnostic tool that accounts for
quence of collinearity is a sign for a regression these issues consists of variance-decomposition
coefficient that is counterintuitive or counter to proportions of the regression coefficient vari-
previous research. The instability of estimates is ance–covariance matrix and the condition index
also realized in very large or inflated standard of the matrix of the predictor variables and con-
errors of the regression coefficients. The fact that stant term. Some less formal diagnostics of
these inflated standard errors are used in signifi- collinearity that are commonly used are a coun-
cance tests of the regression coefficients leads to terintuitive sign in a regression coefficient, a rela-
conclusions of insignificance of regression coeffi- tively large change in value for a regression
cients, even, at times, in the case of important pre- coefficient after another predictor variable is
dictor variables. In contrast to inference on the added to the model, and a relatively large stan-
regression coefficients, collinearity does not impact dard error for a regression coefficient. Given that
the overall fit of the model to the observed statistical inference on regression coefficients is
response variable data. typically a primary concern in regression analy-
sis, it is important for one to apply diagnostic
tools in a regression analysis before interpreting
Diagnosing Collinearity
the regression coefficients, as the effects of col-
There are several commonly used exploratory linearity could go unnoticed without a proper
tools to diagnose potential collinearity in diagnostic analysis.
a regression model. The numerical instabilities in
analysis caused by collinearity among regression
model variables lead to correlation between the
Remedial Methods for Collinearity
estimated regression coefficients, so some techni-
ques assess the level of correlation in both the There are several methods in statistics that
predictor variables and the coefficients. Coeffi- attempt to overcome collinearity in standard lin-
cients of correlation between pairs of predictor ear regression models. These methods include
variables are statistical measures of the strength principal components regression, ridge regres-
of association between variables. Scatterplots of sion, and a technique called the lasso. Principal
the values of pairs of predictor variables provide components regression is a variable subset selec-
a visual description of the correlation among tion method that uses combinations of the
variables, and these tools are used frequently. exogenous variables in the model, and ridge
There are, however, more direct ways to assess regression and the lasso are penalization meth-
collinearity in a regression model by inspecting ods that add a constraint on the magnitude of
the model output itself. One way to do so is the regression coefficients. Ridge regression was
through coefficients of correlation of pairs of designed precisely to reduce collinearity effects
estimated regression coefficients. These statisti- by penalizing the size of regression coefficients.
cal summary measures allow one to assess the The lasso also shrinks regression coefficients, but
level of correlation among different pairs of cov- it shrinks the least significant variable coeffi-
ariate effects as well as the correlation between cients toward zero to remove some terms from
covariate effects and the intercept. Another way the model. Ridge regression and the lasso are
to diagnose collinearity is through variance infla- considered superior to principal components
tion factors, which measure the amount of regression to deal with collinearity in regression
increase in the estimated variances of regression models because they more purposely reduce
coefficients compared with when predictor vari- inflated variance in regression coefficients due to
ables are uncorrelated. Drawbacks of the vari- collinearity while retaining interpretability of
ance inflation factor as a collinearity diagnostic individual covariate effects.
Column Graph 195
80
72
70
70 67 66 65 66
64
62 62
60
60 58
56 56
54 53
51 52 52
50 48
46
Number of Seats
40
30
20
10
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Election
Democrat Republican
respondents for their age could produce dozens of a survey might present the aggregate distribution
potential answers, and therefore it is best to con- of self-reported partisanship, but the researcher
dense the variable into a select few categories (e.g., can also demonstrate the gender gap by displaying
18–25, 26–35, 36–45, 46–64, 65 and older) before separate partisan distributions for male and female
making a column graph that summarizes the respondents. By putting each distribution into a sin-
distribution. gle graph, the researcher can visually present the
gender gap in a readily understandable format.
Column graphs might also be used to explore
Multiple Distribution Column Graphs
chronological trends in distributions. One such
Column graphs can also be used to compare multi- example is in Figure 1, which displays the partisan
ple distributions of data. Rather than presenting makeup of the Illinois state House from the 1990
a single set of vertical rectangles that represents through the 2008 elections. The black bars repre-
a single distribution of data, column graphs pres- sent the number of Republican seats won in the
ent multiple sets of rectangles, one for each distri- previous election (presented on the x-axis) and the
bution. For ease of interpretation, each set of gray bars represent the number of Democratic
rectangles should be grouped together and sepa- seats. By showing groupings of the two partisan
rate from the other distributions. This type of bars side by side, the graph provides for an easy
graph can be particularly useful in comparing interpretation of which party had control of the
counts of observations across different categories legislative chamber after a given election. Further-
of interest. For example, a researcher conducting more, readers can look across each set of bars to
Column Graph 197
100%
90%
80% 46 48
51 52 53 52
58 56 56
64
70%
Number of Seats
60%
50%
40%
30% 72 70
67 66 65 66
60 62 62
54
20%
10%
0%
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Election
Democrat Republican
Figure 2 1990–2009 Illinois House Partisan Distribution (100% Stacked Column Graph)
Source: Almanac of Illinois Politics. (2009). Springfield: Illinois Issues.
determine the extent to which the partisan makeup variables of interest (such as time). Either version
of the Illinois House has changed over time. of the stacked column approach can be useful in
comparing multiple distributions, such as with Fig-
ure 2, which presents a 100% stacked column rep-
Stacked Column
resentation of chronological trends in Illinois
One special form of the column graph is the House partisan makeup.
stacked column, which presents data from a partic-
ular distribution in a single column. A regular Column Graphs and Other Figures
stacked column places the observation counts
directly on top of one another, while a 100% When one is determining whether to use a column
stacked column does the same while modifying graph or some other type of visual representation
each observation count into a percentage of the of relevant data, it is important to consider the
overall distribution of observations. The former main features of other types of figures, particularly
type of graph might be useful when the researcher the level of measurement and the number of distri-
wishes to present a specific category of interest butions of interest.
(which could be at the bottom of the stack) and
the total number of observations. The latter might
Bar Graph
be of interest when the researcher wants a visual
representation of how much of the total distribu- A column graph is a specific type of bar graph
tion is represented by each value and the extent to (or chart). Whereas the bar graph can present sum-
which that distribution is affected by other maries of categorical data in either vertical or
198 Column Graph
80
70
60
Number of Seats
50
40
30
20
10
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Election
Democrat Republican
0 Pie Chart
3 6 9 12 15 18
Number of Correct Answers (Out of 19) A pie chart presents the percentage of each cate-
gory of a distribution as a segment of a circle. This
type of graph allows for only a single distribution
Figure 4 Political Knowledge Variable Distribution at a time; multiple distributions require multiple
(Histogram) pie charts. As with column and bar graphs, pie
Source: Adapted from the University of Illinois Subject Pool, charts represent observation counts (or percen-
Bureau of Educational Research. tages) and thus are used for discrete data, or at the
Completely Randomized Design 199
very least continuous data collapsed into a select See also Bar Chart; Histogram; Interval Scale; Line
few discrete categories. Graph; Nominal Scale; Ordinal Scale; Pie Chart; Ratio
Scale
A line graph displays relationships between two Frankfort-Nachmias, C., & Nachmias, D. (2007).
changing variables by drawing a line that connects Research methods in the social sciences (7th ed.). New
York: Worth.
actual or projected values of a dependent variable
Harris, R. L. (2000). Information graphics: A
(y-axis) based on the value of an independent vari- comprehensive illustrated reference. New York:
able (x-axis). Because line graphs display trends Oxford University Press.
between plots of observations across an indepen- Stoval, J. G. (1997). Infographics: A journalist’s guide.
dent variable, neither the dependent nor the Boston: Allyn and Bacon.
independent variable can contain nominal data.
Furthermore, ordinal data do not lend themselves
to line graphs either, because the data are ordered
only within the framework of the variables. As with COMPLETELY RANDOMIZED DESIGN
column graphs and bar graphs, line graphs can
track multiple distributions of data based on cate- A completely randomized design (CRD) is the sim-
gories of a nominal or ordinal variable. Figure 3 plest design for comparative experiments, as it uses
provides such an example, with yet another method only two basic principles of experimental designs:
of graphically displaying the Illinois House partisan randomization and replication. Its power is best
makeup. The election year serves as the indepen- understood in the context of agricultural experi-
dent variable, and the number of Illinois House ments (for which it was initially developed), and it
seats for a particular party as the dependent vari- will be discussed from that perspective, but true
able. The solid gray and dotted black lines present experimental designs, where feasible, are useful in
the respective Democratic and Republican legisla- the social sciences and in medical experiments.
tive seat counts, allowing for easy interpretation of In CRDs, the treatments are allocated to the
the partisan distribution trends. experimental units or plots in a completely ran-
dom manner. CRD may be used for single- or
multifactor experiments. This entry discusses
Histogram the application, advantages, and disadvantages
of CRD studies and the processes of conducting
A histogram is a special type of column graph
and analyzing them.
that allows for a visual representation of a single
frequency distribution of interval or ratio data
without collapsing the data to a few select cate- Application
gories. Visually, a histogram looks similar to a col-
umn graph, but without any spaces between the CRD is mostly useful in laboratory and green house
rectangles. On the x-axis, a histogram displays experiments in agricultural, biological, animal,
intervals rather than discrete categories. Unlike environmental, and food sciences, where experi-
a bar graph or a column graph, a histogram only mental material is reasonably homogeneous. It is
displays distributions and cannot be used to com- more difficult when the experimental units are
pare multiple distributions. The histogram in Fig- people.
ure 4 displays the distribution of a political
knowledge variable obtained via a 2004 survey of Advantages and Disadvantages
University of Illinois undergraduate students. The
survey included a series of 19 questions about U.S. This design has several advantages. It is very flexi-
government and politics. ble as any number of treatments may be used, with
equal or unequal replications. The design has
Michael A. Lewkowicz a comparatively simple statistical analysis and
200 Completely Randomized Design
retains this simplicity even if some observations For experiments with more than 10 treatments,
are missing or lost accidentally. The design pro- a 2-digit random number table or a combination of
vides maximum degrees of freedom for the estima- two rows or columns of 1-digit random numbers
tion of error variance, which increases the precision can be used. Here each 2-digit random number is
of an experiment. divided by the number of treatments, and the resid-
However, the design is not suitable if a large ual number is selected. When the residual is 00, the
number of treatments are used and the experi- divisor number is selected.The digit 00 already
mental material is not reasonably homogeneous. occurring in the table is discarded. The digit 00 is
Therefore, it is seldom used in agricultural field discarded.
experiments in which soil heterogeneity may be On small pieces of paper identical in shape and
present because of soil fertility gradient or in ani- size, the numbers 1; 2 ; . . . ; N are written. These
mal sciences when the animals (experimental are thoroughly mixed in a box, the papers are
units) vary in such things as age, breed, or initial drawn one by one, and the numbers on the
body weight, or with people. selected papers are random numbers. After each
draw, the piece of paper is put back, and thorough
mixing is performed again.
Layout of the Design
The random numbers may also be generated by
The plan of allocation of the treatments to the computers.
experimental material is called the layout of the
design. Statistical Analysis
XX X
ðyij μÞ2 ¼ ri ðyi . . . y . . .Þ2 þ treatments, and the degree of precision required.
XX 2
The minimum number of replications required to
ðyij yi :Þ , i ¼ 1, . . . , ν; j ¼ 1, 2, . . . , ri , detect the specified differences between two treat-
ment means at a specified level of significance is
that is, the total sum of squares is equal to the sum given by the t statisticpat error df and α% level of
of the treatments sum of squares and the error significance, t ¼ d=fse 2=rg, where d ¼ X1 X2 ;
sum of squares. This implies that the total varia- X1 and X2 are treatment means with each treat-
tion can be partitioned into two components: ment replicated r times, and s2e is error variance.
between treatments and within treatments (experi- Therefore,
mental error).
2
Table 1 is an ANOVA table for CRD. r ¼ 2t2 s2e =d :
Under the null hypothesis, H0 ¼ t1 ¼ t2 ¼ ¼
tv ; that is, all treatment means are equal, the statistic It is observed that from 12 df onwards, the
Ft ¼ s2t =s2e follows the F distribution with (ν 1) values of t and F (for smaller error variance)
and (N ν) degrees of freedom. decrease considerably slowly, and so from empiri-
If F (observed) ≥ F (expected) for the same df cal considerations the number of replications are
and at a specified level of significance, say α%, so chosen as to provide about 12 df for error vari-
then the null hypothesis of equality of means is ance for the experiment.
rejected at α% level of significance.
If H0 is rejected, then the pairwise comparison Nonconformity to the Assumptions
between the treatment means is made by using the
critical difference (CD) test, When the assumptions are not realized, the
researcher may apply one of the various transfor-
CDα% ¼ SEd · t for error df and α% level mations in order to bring improvements in
of significance; assumptions. The researcher then may proceed
with the usual ANOVA after the transformation.
p
where SEd ¼ ErSSð1=ri þ 1=rj Þ=ðN vÞ; SEd Some common types of transformations are
stands for standard error of difference between described below.
means
p of two treatments and when ri ¼ rj ; SEd ¼
2ErSS=frðN vÞg.
Arc Sine Transformation
If the difference between any two treatment
means is greater than or equal to the critical differ- This transformation, also called angular transfor-
ence, the treatment means are said to differ mation, is used for count data obtained from bino-
significantly. mial distribution, such as success–failure, diseased–
The critical difference is also called the least sig- nondiseased, infested–noninfested, barren–nonbarren
nificant difference. tillers, inoculated–noninoculated, male–female,
dead–alive, and so forth.
The transformation is not applicable to per-
Number of Replications
centages of carbohydrates, protein, profit, dis-
The number of replications required for an experi- ease index, and so forth, which are not derived
ment is affected by the inherent variability and from count data. The transformation is not
size of the experimental material, the number of needed if nearly all the data lie between 30%
202 Computerized Adaptive Testing
and 70%. The transformation may be used when to examinees differ based on the examinees’
data range from 0% to 30% and beyond or from responses to previous questions. Computerized
100% to 70% and below. For transformation, adaptive testing uses computers to facilitate the
0/n and n/n should be taken as 1=4n and ‘‘adaptive’’ aspects of the process and to automate
(1 1=4n), respectively. scoring. This entry discusses historical pers-
pectives, goals, psychometric and item selection
approaches, and issues associated with adaptive
Square Root Transformation testing in general and computerized adaptive test-
The square root transformation is used for data ing in particular.
from a Poisson distribution, that is, when data are
counts of rare events such as number of defects or Historical Perspectives
accidents, number of infested plants in a plot,
insects caught on traps, or weeds per plot. The Adaptive testing is not new. Through the ages
transformation consists of taking the square root examiners have asked questions and, depending on
of each observation before proceeding with the the response given, have chosen different direc-
ANOVA in the usual manner. tions for further questioning for different exami-
nees. Clinicians have long taken adaptive
approaches, and so since the advent of standard-
Logarithmic Transformation ized intelligence testing, many such tests have used
The logarithmic transformation is used when adaptive techniques. For both the 1916 edition of
standard deviation is proportional to treatment the Stanford-Binet and the 1939 edition of the
means, that is, if the coefficient of variation is con- Wechsler-Bellevue intelligence tests, the examiner
stant. The transformation achieves additivity. It is chose a starting point and, if the examinee
used for count data (large whole numbers) cover- answered correctly, asked harder questions until
ing a wide range, such as number of insects per a string of incorrect answers was provided. If the
plant or number of egg masses. When zero occurs, first answer was incorrect, an easier starting point
1 is added to it before transformation. was chosen.
Early group-administered adaptive tests faced
Kishore Sinha administration problems in the precomputerized
administration era. Scoring each item and making
See also Analysis of Variance (ANOVA); Experimental continual routing decisions was too slow a process.
Design; Fixed-Effects Models; Randomized Block One alternative explored by William Angoff and
Design; Research Design Principles Edith Huddleston in the 1950s was two-stage test-
ing. An examinee would take a half-length test of
Further Readings medium difficulty, that test was scored, and then
the student would be routed to either an easier or
Bailey, R. A. (2008). Design of comparative experiments. a harder last half of the test. Another alternative
Cambridge, UK: Cambridge University Press.
involved use of a special marker that revealed
Box, G. E. P., Hunter, W. G., & Hunter, J. S. (1978).
Statistics for experiments. New York: Wiley.
invisible ink when used. Examinees would use
Dean, A., & Voss, D. (1999). Design and analysis of their markers to indicate their responses. The
experiments. New York: Springer. marker would reveal the number of the next item
Montgomery, D. C. (2001). Design and analysis of they should answer. These and other approaches
experiments. New York: Wiley. were too complex logistically and never became
popular.
Group-administered tests could not be adminis-
tered adaptively with the necessary efficiency until
COMPUTERIZED ADAPTIVE TESTING the increasing availability of computer-based testing
systems, circa 1970. At that time research on com-
Adaptive testing, in general terms, is an assessment puter-based testing proliferated. In 1974, Frederic
process in which the test items administered Lord (inventor of the theoretical underpinnings of
Computerized Adaptive Testing 203
much of modern psychometric theory) suggested guess the answers to very difficult items. Thus
the field would benefit from researchers’ getting the answers to items that are very difficult for
together to share their ideas. David Weiss, a profes- a student are worse than useless—they can be
sor at the University of Minnesota, thought this misleading.
a good idea and coordinated a series of three con- An adaptive test maximizes reliability by repla-
ferences on computerized adaptive testing in 1975, cing items that are too difficult or too easy with
1977, and 1979, bringing together the greatest items of an appropriate difficulty. Typically this is
thinkers in the field. These conferences energized done by ordering items by difficulty (usually using
the research community, which focused on the the- sophisticated statistical models such as Rasch scal-
oretical underpinnings and research necessary to ing or the more general item response theory) and
develop and establish the psychometric quality of administering more difficult items subsequent to
computerized adaptive tests. correct responses and easier items after incorrect
The first large-scale computerized adaptive responses.
tests appeared circa 1985, including the U.S. By tailoring the difficulty of the items adminis-
Army’s Computerized Adaptive Screening Test, tered dependent on prior examinee responses,
the College Board’s Computerized Placement a well-developed adaptive test can achieve the reli-
Tests (the forerunner of today’s Accuplacer), and ability of a conventional test with approximately
the Computerized Adaptive Differential Ability one half to one third the items or, alternatively,
Tests of the Psychological Corporation (now can achieve a significantly higher reliability in
part of Pearson Education). Since that time com- a given amount of time. (However, testing time is
puterized adaptive tests have proliferated in all unlikely to be reduced in proportion to the item
spheres of assessment. reduction because very easy and very difficult
items often require little time to answer correctly,
or not try, respectively.)
Goals of Adaptive Testing
The choice of test content and administration
Minimization of Testing Time
mode should be based on the needs of a testing
program. What is best for one program is not An alternative goal that can be addressed with
necessarily of importance for other programs. adaptive testing is minimizing the testing time
There are primarily three different needs that needed to achieve a fixed level of reliability. Items
can be addressed well by adaptive testing: maxi- can be administered until the reliability of the test
mization of test reliability for a given testing for a particular student, or the decision accuracy
time, minimization of individual testing time for a test that classifies examinees into groups
to achieve a particular reliability or decision (such as pass–fail), reaches an acceptable level. If
accuracy, and the improvement of diagnostic the match between the proficiency of an examinee
information. and the quality of the item pool is good (e.g., if
there are many items that measure proficiency par-
ticularly well within a certain range of scores), few
Maximization of Test Reliability
items will be required to determine an examinee’s
Maximizing reliability for a given testing time proficiency level with acceptable precision. Also,
is possible because in conventional testing, many some examinees may be very consistent in their
examinees spend time responding to items that responses to items above or below their proficiency
are either trivially easy or extremely hard, and level—it might require fewer items to pinpoint
thus many items do not contribute to our under- their proficiency compared with people who are
standing of what the examinee knows and can inconsistent.
do. If a student can answer complex multiplica- Some approaches to adaptive testing depend on
tion and division questions, there is little to be the assumption that items can be ordered along
gained asking questions about single-digit addi- a single continuum of difficulty (also referred to as
tion. When multiple-choice items are used, stu- unidimensionality). One way to look at this is that
dents are sometimes correct when they randomly the ordering of item difficulties is the same for all
204 Computerized Adaptive Testing
identifiable groups of examinees. In some areas of problem. Thus, in such an approach, expert
testing, this is not a reasonable assumption. For knowledge of relationships within the content
example, in a national test of science for eighth- domain drives the selection of the next item to be
grade students, some students may have just stud- administered. Although this is an area of great
ied life sciences and others may have completed research interest, at this time there are no large-
a course in physical sciences. On average, life sci- scale adaptive testing systems that use this
ence questions will be easier for the first group approach.
than for the second, while physical science ques-
tions will be relatively easier for the second group.
Psychometric Approaches to Scoring
Sometimes the causes of differences in dimension-
ality may be more subtle, and so in many branches There are four primary psychometric approaches
of testing, it is desirable not only to generate an that have been used to support item selection and
overall total score but to produce a diagnostic pro- scoring in adaptive testing: maximum likelihood
file of areas of strength and weakness, perhaps item response theory, Bayesian item response the-
with the ultimate goal of pinpointing remediation ory, classical test theory, and decision theory.
efforts. Maximum likelihood item response theory
An examinee who had recently studied physical methods have long been used to estimate exam-
science might get wrong a life science item that inee proficiency. Whether they use the Rasch
students in general found easy. Subsequent items model, three-parameter logistic model, partial
would be easier, and that student might never credit, or other item response theory models,
receive the relatively difficult items that she or he fixed (previously estimated) item parameters
could have answered correctly. That examinee make it fairly easy to estimate the proficiency
might receive an underestimate of her or his true level most likely to have led to the observed pat-
level of science proficiency. tern of item scores.
One way to address this issue for multidimen- Bayesian methods use information in addition
sional constructs would be to test each such con- to the examinee’s pattern of responses and previ-
struct separately. Another alternative would be to ously estimated item parameters to estimate
create groups of items balanced on content—test- examinee proficiency. Bayesian methods assume
lets—and give examinees easier or harder testlets a population distribution of proficiency scores,
as the examinees progress through the test. often referred to as a prior, and use the unlikeness
of an extreme score to moderate the estimate of
the proficiency of examinees who achieved such
Improvement of Diagnostic Information
extreme scores. Since examinees who have large
Diagnostic approaches to computerized adap- positive or negative errors of measurement tend to
tive testing are primarily in the talking stage. get more extreme scores, Bayesian approaches tend
Whereas one approach to computerized adaptive to be more accurate (have smaller errors of mea-
diagnostic testing would be to use multidimen- surement) than maximum likelihood approaches.
sional item response models and choose items The essence of Bayesian methods for scoring
based on the correctness of prior item responses is that the probabilities associated with the
connected to the difficulty of the items on the mul- proficiency likelihood function are multiplied by
tiple dimensions, an alternative approach is to lay the probabilities in the prior distribution, leading
out a tested developmental progression of skills to a new, posterior, distribution. Once that
and select items based on precursor and ‘‘postcur- posterior distribution is produced, there are two
sor’’ skills rather than on simple item difficulty. For primary methods for choosing an examinee
example, if one posits that successful addition of score. The first is called modal estimation or
two-digit numbers with carrying requires the abil- maximum a posteriori—pick the proficiency
ity to add one-digit numbers with carrying, and an level that has the highest probability (probability
examinee answers a two-digit addition with carry- density). The second method is called expected
ing problem incorrectly, then the next problem a poste-riori and is essentially the mean of the
asked should be a one-digit addition with carrying posterior distribution.
Computerized Adaptive Testing 205
Table 1 Record for Two Hypothetical Examinees Taking a Mastery Test Requiring 60% Correct on the Full-Length
(100-item) Test
Examinee 1 Examinee 2
Item response theory proficiency estimates examinee’s estimated score is entirely above or
have different psychometric characteristics from entirely below the proportion-correct cut score for
classical test theory approaches to scoring tests a pass decision.
(based on the number of items answered cor- For example, consider a situation in which to
rectly). The specifics of these differences are pass, an examinee must demonstrate mastery of
beyond the scope of this entry, but they have led 60% of the items in a 100-item pool. We could
some testing programs (especially ones in which ask all 100 items and see what percentage
some examinees take the test traditionally and the examinee answers correctly. Alternatively, we
some take it adaptively) to transform the item could ask questions and stop as soon as the exam-
response theory proficiency estimate to an esti- inee answers 60 items correctly. This could save as
mate of the number right that would have been much as 40% of computer time at a computerized
obtained on a hypothetical base form of the test. test administration center. In fact, one could be
From the examinee’s item response theory–esti- statistically confident that an examinee will get at
mated proficiency and the set of item parameters least 60% of the items correct administering fewer
for the base form items, the probability of a cor- items.
rect response can be estimated for each item, The Lewis–Sheehan decision theoretic approach
and the sum of those probabilities is used to esti- is based on a branch of statistical theory called
mate the number of items the examinee would Waldian sequential ratio testing, in which after
have answered correctly. each sample of data, the confidence interval is cal-
Decision theory is a very different approach culated, and a judgment is made, either that
that can be used for tests that classify or categorize enough information is possessed to make a decision
examinees, for example into two groups, passing or that more data needs to be gathered. Table 1
and failing. One approach that has been imple- presents two simplified examples of examinees
mented for at least one national certification exam- tested with multiple parallel 10-item testlets. The
ination was developed by Charles Lewis and percent correct is boldfaced when it falls outside
Kathleen Sheehan. Parallel testlets—sets of perhaps of the 99% confidence interval and a pass–fail
10 items that cover the same content and are of decision is made.
about the same difficulty—were developed. After The confidence interval in this example narrows
each testlet is administered, a decision is made to as the sample of items (number of testlets adminis-
either stop testing and make a pass–fail decision or tered) increases and gets closer to the full 100-item
administer another testlet. The decision is based domain. Examinee 1 in Table 1 has a true domain
on whether the confidence interval around an percent correct close to the cut score of .60. Thus
206 Computerized Adaptive Testing
it takes nine testlets to demonstrate the examinee If items are chosen solely by difficulty, some exam-
is outside the required confidence interval. Exam- inees might not receive any items about electricity
inee 2 has a true domain percent correct that is and others might receive no items about optics.
much higher than the cut score and thus demon- Despite items’ being scaled on a national sample,
strates mastery with only two testlets. any given examinee might have an idiosyncratic
pattern of knowledge. Also, the sample of items in
the pool might be missing topic coverage in certain
Item Selection Approaches
difficulty ranges. A decision must be made to focus
Item selection can be subdivided into two pieces: on item difficulty, content coverage, or some com-
selecting the first item and selecting all subsequent bination of the two.
items. Selection of the first item is unique because Another consideration is item exposure. Since
for all other items, some current information most adaptive test item pools are used for exten-
about student proficiency (scores from previous ded periods of time (months if not years), it is
items) is available. Several approaches can be used possible for items to become public knowledge.
for selecting the first item. Everyone can get the Sometimes this occurs as a result of concerted
same first item—one that is highly discriminating cheating efforts, sometimes just because examinees
near the center of the score distribution—but such and potential examinees talk to each other.
an approach would quickly make that item nonse- More complex item selection approaches, such
cure (i.e., it could be shared easily with future as developed by Wim van der Linden, can be used
examinees). An alternative would to be to ran- to construct ‘‘on the fly’’ tests that meet a variety
domly select from a group of items that discrimi- of predefined constraints on item selection.
nate well to the center of the score distribution.
Yet another approach might be to input some prior
information (for example, class grades or previous Issues
year’s test scores) to help determine a starting
Item Pool Requirements
point.
Once the first item is administered, several Item pool requirements for adaptive tests are
approaches exist for determining which items typically more challenging than for traditional
should be administered next to an examinee. tests. With this proliferation, the psychometric
There are two primary approaches to item selec- advantages of adaptive testing have often been
tion for adaptive tests based on item response delivered, but logistical problems have been dis-
theory: maximum information and minimum covered. When stakes are high (for example,
posterior variance. With maximum information admissions or licensure testing), test security is
approaches, one selects the item that possesses very important—ease of cheating would invalidate
the maximum amount of statistical information test scores. It is still prohibitively expensive to
at the current estimated proficiency. Maximum arrange for a sufficient number of computers in
information approaches are consistent with a secure environment to test hundreds of thou-
maximum likelihood item response theory scor- sands of examinees at one time, as some of the
ing. Minimum posterior variance methods are largest testing programs have traditionally done
appropriate for Bayesian scoring approaches. with paper tests. As an alternative, many comput-
With this approach, the item is selected that, erized adaptive testing programs (such as the
after scoring, will lead to the smallest variance Graduate Record Examinations) have moved from
of the posterior distribution, which is used to three to five administrations a year to administra-
estimate the examinee’s proficiency. tions almost every day. This exposes items to large
Most adaptive tests do not apply either of these numbers of examinees who sometimes remember
item selection approaches in their purely statistical and share the items. Minimizing widespread cheat-
form. One reason to vary from these approaches is ing when exams are offered so frequently requires
to ensure breadth of content coverage. Consider the creation of multiple item pools, an expensive
a physics test covering mechanics, electricity, mag- undertaking, the cost of which is often passed on
netism, optics, waves, heat, and thermodynamics. to examinees.
Concomitant Variable 207
Another issue relates to the characteristics of See also Bayes’s Theorem; b Parameter; Decision Rule;
the item pool required for adaptive testing. In a tra- Guessing Parameter; Item Response Theory;
ditional norm-referenced test, most items are cho- Reliability; ‘‘Sequential Tests of Statistical
sen so that about 60% of the examinees answer Hypotheses’’
the item correctly. This difficulty level maximally
differentiates people on a test in which everyone
gets the same items. Thus few very difficult items Further Readings
are needed, which is good since it is often difficult Wainer, H. (Ed.). (1990). Computerized adaptive testing:
to write very difficult items that are neither ambig- A primer. Hillsdale, NJ: Lawrence Erlbaum.
uous nor trivial. On an adaptive test, the most pro- Mills, C., Potenza, M., Fremer, J., & Ward, W. (Ed.)
ficient examinees might all see the same very (2002). Computer-based testing. Hillsdale, NJ:
difficult items, and thus to maintain security, Lawrence Erlbaum.
a higher percentage of such items are needed. Parshall, C., Spray, J., Kalohn, J., & Davey, T. (2002).
Practical considerations in computer-based testing.
New York: Springer.
Comparability
Comparability of scores on computerized and
traditional tests is a very important issue as long as
a testing program uses both modes of administra-
CONCOMITANT VARIABLE
tion. Many studies have been conducted to deter-
mine whether scores on computer-administered It is not uncommon in designing research for an
tests are comparable to those on traditional paper investigator to collect an array of variables repre-
tests. Results of individual studies have varied, senting characteristics on each observational unit.
with some implying there is a general advantage to Some of these variables are central to the investiga-
examinees who take tests on computer, others tion, whereas others reflect preexisting differences
implying the advantage goes to examinees taking in observational units and are not of interest
tests on paper, and some showing no statistically per se. The latter are called concomitant variables,
significant differences. Because different computer- also referred to as covariates. Frequently in prac-
administration systems use different hardware, tice, these incidental variables represent undesired
software, and user interfaces and since these stud- sources of variation influencing the dependent
ies have looked at many different content domains, variable and are extraneous to the effects of mani-
perhaps these different results are real. However, pulated (independent) variables, which are of pri-
in a 2008 meta-analysis of many such studies of mary interest. For designed experiments in which
K–12 tests in five subject areas, Neal Kingston observational units are randomized to treatment
showed a mean weighted effect size of .01 stan- conditions, failure to account for concomitant
dard deviations, which was not statistically signifi- variables can exert a systematic influence (or bias)
cant. However, results differed by data source and on the different treatment conditions. Alterna-
thus the particular test administration system used. tively, concomitant variables can increase the error
variance, thereby reducing the likelihood of detect-
ing real differences among the groups. Given these
Differential Access to Computers potential disadvantages, an ideal design strategy
Many people are concerned that differential is one that would minimize the effect of the
access to computers might disadvantage some unwanted sources of variation corresponding to
examinees. While the number of studies looking at these concomitant variables. In practice, two gen-
potential bias related to socioeconomic status, gen- eral approaches are used to control the effects of
der, ethnicity, or amount of computer experience is concomitant variables: (1) experimental control
small, most recent studies have not found signifi- and (2) statistical control. An expected benefit of
cant differences in student-age populations. controlling the effects of concomitant variables by
either or both of these approaches is a substantial
Neal Kingston reduction in error variance, resulting in greater
208 Concomitant Variable
precision for estimating the magnitude of treat- models used in analyzing the data, for instance by
ment effects and increased statistical power. the use of socioeconomic status as a covariate in
an analysis of covariance (ANCOVA). Statisti-
Experimental Control cal control for regression procedures such as
ANCOVA means removing from the experimental
Controlling the effects of concomitant variables is error and from the treatment effect all extraneous
generally desirable. In addition to random assign- variance associated with the concomitant variable.
ment of subjects to experimental conditions, meth- This reduction in error variance is proportional to
ods can be applied to control these variables in the the strength of the linear relationship between the
design phase of a study. One approach is to use dependent variable and the covariate and is often
a small number of concomitant variables as the quite substantial. Consequently, statistical control
inclusion criteria for selecting subjects to participate is most advantageous in situations in which the
in the study (e.g., only eighth graders whose parents concomitant variable and outcome have a strong
have at least a high school education). A second linear dependency (e.g., a covariate that represents
approach is to match subjects on a small number of an earlier administration of the same instrument
concomitant variables and then randomly assign used to measure the dependent variable).
each matched subject to one of the treatment condi- In quasi-experimental designs in which random
tions. This requires that the concomitant variables assignment of observational units to treatments is
are available prior to the formation of the treat- not possible or potentially unethical, statistical
ment groups. Blocking, or stratification, as it is control is achieved through adjusting the estimated
sometimes referred to, is another method of con- treatment effect by controlling for preexisting
trolling concomitant variables in the design stage of group differences on the covariate. This adjust-
a study. The basic premise is that subjects are sorted ment can be striking, especially when the differ-
into relatively homogeneous blocks on the basis of ence on the concomitant variable across intact
levels of one or two concomitant variables. The treatment groups is dramatic. Like blocking or
experimental conditions are subsequently random- matching, using ANCOVA to equate groups on
ized within each stratum. Exerting experimental important covariates should not be viewed as
control through case selection, matching, and a substitute for randomization. Control of all
blocking can reduce experimental error, often resul- potential concomitant variables is not possible in
ting in improved statistical power to detect differ- quasi-experimental designs, which therefore are
ences among the treatment groups. As an exclusive always subject to threats to internal validity from
design strategy, the usefulness of any one of these unidentified covariates. This occurs because un-
three methods to control the effects of concomitant controlled covariates may be confounded with the
variables is limited, however. It is necessary to rec- effects of the treatment in a manner such that
ognize that countless covariates, in addition to group comparisons are biased.
those used to block or match subjects, may be
affecting the dependent variable and thus posing
potential threats to drawing appropriate inferences Benefits of Control
regarding treatment veracity. In contrast, randomi- To the extent possible, research studies should be
zation to experimental conditions ensures that any designed to control concomitant variables that are
idiosyncratic differences among the groups are not likely to systematically influence or mask the
systematic at the outset of the experiment. Random important relationships motivating the study. This
assignment does not guarantee that the groups are can be accomplished by exerting experimental
equivalent but rather that any observed differences control through restricted selection procedures,
are due only to chance. randomization of subjects to experimental condi-
tions, or stratification. In instances in which com-
plete experimental control is not feasible or in
Statistical Control
conjunction with limited experimental control, sta-
The effects of concomitant variables can be tistical adjustments can be made through regres-
controlled statistically if they are included in the sion procedures such as ANCOVA. In either case,
Concurrent Validity 209
Diagnosis by Psychologist
80 Classification from
Hands-On Measure
40
availability or the cost of the measure. Thus, the selected from the same population, one confidence
practical limitations associated with criterion mea- interval is constructed based on one sample with
surements that are inconvenient, expensive, or a certain confidence level. Together, all the confi-
highly impractical to obtain may outweigh other dence intervals should include the population
desirable qualities of these measures. parameter with the confidence level.
Suppose one is interested in estimating the pro-
Jessica Lynn Mislevy and André A. Rupp portion of bass among all types of fish in a lake. A
95% confidence interval for this proportion,
See also Criterion Validity; Predictive Validity
[25%, 36%], is constructed on the basis of a ran-
dom sample of fish in the lake. After more inde-
Further Readings pendent random samples of fish are selected from
Crocker, L., & Algina, J. (1986). Introduction to classical the lake, through the same procedure more confi-
and modern test theory. Belmont, CA: Wadsworth dence intervals are constructed. Together, all these
Group. confidence intervals will contain the true propor-
Thorndike, R. M. (2005). Measurement and evaluation tion of bass in the lake approximately 95% of
in psychology and education (7th ed.). Upper Saddle the time.
River, NJ: Pearson Education. The lower and upper boundaries of a confidence
interval are called lower confidence limit and
upper confidence limit, respectively. In the earlier
example, 25% is the lower 95% confidence limit,
CONFIDENCE INTERVALS and 36% is the upper 95% confidence limit.
0 at the significance level .025. The researcher will The Population Follows a Normal Distribution
construct a one-sided confidence interval taking With Unknown Variance.
the form of ð∞; b with some constant b. Note Suppose a sample of size N is randomly selected
that the width of a one-sided confidence interval is from the population with observations x1 ;
infinity. Following the above example, the null P
N
hypothesis would be that the mean is less than or x2 ; . . . ; xN . Let x ¼ xi =N. This is the sample
equal to 0 at the .025 level. If the 97.5% one-sided i¼1
confidence interval is ð∞; 3:8, then the null mean. The sample variance is defined as
hypothesis is accepted because 0 is included in the P
N
ðxi xÞ2 =ðN 1Þ, denoted by s2. Therefore
interval. If the 97.5% confidence interval is i¼1
pffiffiffiffiffi
ð∞; 2:1 instead, then the null hypothesis is the confidence interval is ½x tN1;α=2 × s= N ;
rejected because there is no overlap between pffiffiffiffiffi
x þ tN1;α=2 × s= N . Here tN1;α=2 is the upper
ð∞; 2:1 and ½0; ∞Þ. α=2 quantile, meaning PðtN1;α=2 ≤ TN1 Þ ¼ α=2
and TN1 follows the t distribution with degree of
freedom N 1. Refer to the t-distribution table or
Examples use software to get tN1;α=2 . If N is greater than
Confidence Intervals for a Population 30, one can use zα=2 instead of tN1;α=2 in the con-
Mean With Confidence Level 100ð1 αÞ% fidence interval formula because there is little dif-
ference between them for large enough N.
Confidence intervals for a population mean are
As an example, the students’ test scores in a class
constructed on the basis of the sample mean
follow a normal distribution. One wants to con-
distribution.
struct a 95% confidence interval for the class aver-
age score based on an available random sample of
The Population Follows a Normal Distribution size N ¼ 10. The 10 scores are 69, 71, 77, 79, 82,
With a Known Variance σ 2 84, 80, 94, 78, and 67. The sample mean and the
After a random sample of size N is selected sample variance are 78.1 and 62.77, respectively.
from the population, one is able to calculate the According the t-distribution table, t9;0:25 ¼ 2:26.
sample mean x, which is the average of all the The 95% confidence interval for the class average
observations. pThus the confidence score is ½72.13, 84.07.
ffiffiffiffiffi pffiffiffiffiffi interval is
½x zα=2 × σ= N ; x þ zα=2 × σ= N that is p cen-
ffiffiffiffiffi The Population Follows a General Distribution
tered at x with half of the length zα=2 × σ= N . Other Than Normal Distribution
zα=2 is the upper α=2 quantile, meaning
Pðzα=2 ≤ ZÞ ¼ α=2. Here Z is a standard normal This is a common situation one might see in
random variable. To find zα=2 , one can either refer practice. After obtaining a random sample of size
to the standard normal distribution table or use N from the population, where it is required that
statistical software. Nowadays most of the statisti- N ≥ 30, the sample mean and the sample variance
cal software, such as Excel, R, SAS, SPSS (an IBM can be computed as in the previous subsections,
company, formerly called PASWâ Statistics), and denoted by x and s2 , respectively. According to the
Splus, have a simple command that will do the central limit theorem, an approximate confidence pffiffiffiffiffi
job. For commonly used confidence intervals, 90% interval can pbeffiffiffiffiffi expressed as ½x zα=2 × s= N ;
confidence level corresponds to z0:05 ¼ 1.645, 95% x þ zα=2 × s= N .
corresponds to z0:025 ¼ 1.96, and 99% corresponds
to z0:005 ¼ 2.56.
Confidence Interval for
The above confidence interval will shrink to
Difference of Two Population Means
a point if N goes to infinity. So the interval esti-
With Confidence Level 100ð1 αÞ%
mate turns into a point estimate. It can be inter-
preted as if the whole population is taken as One will select a random sample from each
the sample; the sample mean is actually the population. Suppose the sample sizes are N1 and
population mean. N2 . Denote the sample means by x1 and x2
Confidence Intervals 213
and the sample variances by s21 and s22 , For example, a doctor wants to construct
respectively, for the two samples. The confidence a 99% confidence interval for the chance of having
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
a certain disease by studying patients’ x-ray slides.
interval is ½x1 x2 zα=2 × s21 =N1 þ s22 =N2 ,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N ¼ 30 x-ray slides are randomly selected, and
x1 x2 þ zα=2 × s21 =N1 þ s22 =N2 . the number of positive slides follows a distribution
known as the binomial distribution. Suppose 12 of
If one believes that the two population ^ ¼ 12/30,
them are positive for the disease. Hence p
variances are about the same, the confidence ^ ¼ 12
which is the sample proportion. Since N p
interval will be ½x1 x2 tN1 þN2 2;α=2 ×sp ; x1 ^
and Nð1 pÞ ¼ 18 are both larger than 5, the
x2 þ tN1 þN2 2;α=2 × sp , where confidence interval for the unknown proportion
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sp ¼ ðN1 þN2 Þ½ðN1 1Þs21 þ ðN2 1Þs22 =½ðN1 þ can be constructed using the normal approxima-
tion. The estimated standard error for p ^ is 0.09.
N2 2ÞN1 N2 :
Continuing with the above example about the Thus the lower 99% confidence limit is 0.17 and
students’ scores, call that class Class A: If one is the upper 99% confidence limit is 0.63. So the
interested in comparing the average scores 99% confidence interval is ½0:17; 0:63.
between Class A and another class, Class B, the The range of a proportion is between 0 and 1.
confidence interval for the difference between But sometimes the constructed confidence interval
the average class scores will be constructed. for the proportion may exceed it. When this hap-
First, randomly select a group of students in pens, one should truncate the confidence interval
Class B. Suppose the group size is 8. These eight to make the lower confidence limit 0 or the upper
students’ scores are 68, 79, 59, 76, 80, 89, 67, confidence limit 1.
and 74. The sample mean is 74, and the sample Since the binomial distribution is discrete, a cor-
variance is 86.56 for Class B. If Class A and rection for continuity of 0:5=N may be used to
Class B are believed to have different population improve the performance of confidence intervals.
variances, then the 95% confidence interval The corrected upper limit is added by 0:5=N, and
for the difference of the average scores is the corrected lower limit is subtracted by 0:5=N.
½4.03, 12.17] by the first formula provided in One can also construct confidence intervals
this subsection. If one believes these two classes for the proportion difference between two popu-
have about the same population variance, then lations based on the normal approximation. Sup-
the 95% confidence interval will be changed to pose two random samples are independently
½4.47, 12.57] by the second formula. selected from the two populations, with sample
sizes N1 and N2 and sample proportions p ^1 and
Confidence Intervals for a Single Proportion ^
p2 respectively. The estimated population
and Difference of Two Proportions proportion difference is the sample proportion
difference, p ^1 p
^2 , and the estimated standard
Sometimes one may need to construct confi-
error for the proportion difference is
dence intervals for a single unknown population pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^ the sample proportion, seðp^1 p^2 Þ ¼ p ^1 ð1 p ^1 Þ = N1 þ p ^2 ð1 p ^2 Þ=N2 .
proportion. Denote p
which can be obtained from a random sample The confidence interval for two-sample propor-
tion difference is ½ðp ^1 p ^2 Þ zα=2 × seðp ^1 p ^2 Þ;
from the population. The estimated standard ffi error
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
for the proportion is seðp ^Þ ¼ p ^ð1 p ^Þ=N. Thus ðp1 p2 Þ þ zα=2 × seðp1 p2 Þ:
^ ^ ^ ^
the confidence interval for the unknown propor- Similar to the normal approximation for a single
^ zα=2 × seðp
tion is ½p ^ þ zα=2 × seðp
^Þ; p ^Þ. proportion, the approximation for the proportion
This confidence interval is constructed on the difference depends on sample sizes and sample pro-
basis of the normal approximation (refer to the cen- portions. The rule of thumb is that N p ^1 , Nð1 p ^1 Þ,
tral limit theorem for the normal approximation). Np ^2 , and Nð1 p ^2 Þ should each be larger than 10.
The normal approximation is not appropriate when
Confidence Intervals for Odds Ratio
the proportion is very close to 0 or 1. A rule of
thumb is that when N p ^ > 5 and Nð1 p ^Þ > 5, An odds ratio (OR) is a commonly used effect
usually the normal approximation works well. size for categorical outcomes, especially in health
214 Confidence Intervals
science, and is the ratio of the odds in Category 1 Table 1 Lung Cancer Among Smokers and Nonsmokers
to the odds in Category 2. For example, one wants Smoking Lung No Lung
to find out the relationship between smoking and Status Cancer Cancer Total
lung cancer. Two groups of subjects, smokers and Smokers N11 N12 N1.
nonsmokers, are recruited. After a few years’ fol- Nonsmokers N21 N22 N2.
low-ups, N11 subjects among the smokers are diag- Total N.1 N.2 N
nosed with lung cancer and N21 subjects among
the nonsmokers. There are N12 and N22 subjects
N11 =ðN11 þ N12 Þ, and the risk among nonsmo-
who do not have lung cancer among the smokers
kers is estimated as N21 =ðN21 þ N22 Þ. The RR is
and the nonsmokers, respectively. The odds of hav-
the ratio of the above two risks, which is
ing lung cancer among the smokers and the non-
½N11 =ðN11 þ N12 Þ=½N21 =ðN21 þ N22 Þ.
smokers are estimated as N11 =N12 and N21 =N22 ,
Like the OR, the sampling distribution of
respectively. The OR of having lung cancer
lnðRRÞ is approximated to a normal distribution.
among the smokers compared with the nonsmo-
The standard error for lnðRRÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi is
kers is the ratio of the above two odds, which is
seðlnðRRÞÞ ¼ 1=N11 1=N1 : þ 1=N21 1=N2 .
ðN11 =N12 Þ=ðN21 =N22 Þ ¼ ðN11 N22 Þ=ðN21 N12 Þ: A 95% confidence interval for lnðRRÞ is
For a relatively large total sample size, lnðORÞ ½lnðRRÞ 1:96 × seðlnðRRÞÞ, lnðRRÞ þ 1:96
is approximated to a normal distribution, so the × seðlnðRRÞÞ:
construction for the confidence interval for
lnðORÞ is similar to that for the normal distribu- Thus the 95% confidence interval for the RR is
tion. The standard error for lnðORÞ is defined as
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½exp ðlnðRRÞ 1:96 × se ðlnðRRÞÞÞ; exp ðlnðRRÞ þ
seðlnðORÞÞ ¼ 1=N11 þ 1=N12 þ1=N21 þ 1=N22 . 1:96 × seðlnðRRÞÞÞ.
A 95% confidence interval for lnðORÞ is The confidence intervals for the RRs are not
½lnðORÞ 1:96 × seðlnðORÞÞ, lnðORÞ þ 1:96 × se symmetric about the estimated RR either. One can
ðlnðORÞÞ. As the exponential function is mono- tell the significance of a test from the corresponding
tonic, there is one-to-one mapping between the confidence interval for the RR. Usually the null
OR and ln(OR). Thus a 95% confidence interval hypothesis is that RR ¼ 1, which means that the
for the OR is ½expðlnðORÞ 1:96 × seðlnðORÞÞÞ, two groups have the same risk. For the above
expðlnðORÞ þ 1:96 × seðlnðORÞÞÞ: example, the null hypothesis would be that the risks
The confidence intervals for the OR are not of developing lung cancer among smokers and non-
symmetric about the estimated OR. But one can smokers are equal. If 1 is included in the confidence
still tell the significance of the test on the basis of interval, one may accept the null hypothesis. If not,
the corresponding confidence interval for the OR. the null hypothesis should be rejected.
For the above example, the null hypothesis is that
there is no difference between smokers and non- Confidence Intervals for Variance
smokers in the development of lung cancer; that is,
A confidence interval for unknown population
OR ¼ 1. If 1 is included in the confidence inter-
variance can be constructed with the use of a central
val, one should accept the null hypothesis; other-
chi-square distribution. For a random sample with
wise, one should reject it.
size N and sample variance s2 , an approximate
two-sided 100ð1 αÞ% confidence interval for
population variance σ 2 is ½ðN 1Þs2 =χ2N1 ðα=2Þ,
Confidence Intervals for Relative Risk
ðN 1Þs2 =χ2N1 ð1 α=2Þ. Here χ2N1 ðα=2Þ is at
Another widely used concept in health care is the upper α=2 quantile satisfying the requirement
relative risk (RR), which is the risk difference bet- that the probability that a central chi-square ran-
ween two groups. Risk is defined as the chance of dom variable with degree of freedom N 1 is
having a specific outcome among subjects in that greater than χ2N1 ðα=2Þ is α=2. Note that this confi-
group. Taking the above example, the risk of hav- dence interval may not work well if the sample size
ing lung cancer among smokers is estimated as is small or the distribution is far from normal.
Confidence Intervals 215
Simultaneous Confidence Intervals Qiaoyan Hu, Shi Zhao, and Jie Yang
Simultaneous confidence intervals are intervals for See also Bayes’s Theorem; Boostrapping; Central Limit
estimating two or more parameters at a time. For Theorem; Normal Distribution; Odds Ratio; Sample;
example, suppose μ1 and μ2 are the means of two Significance, Statistical
different populations. One wants to find confi-
dence intervals I1 and I2 simultaneously such that Further Readings
Blyth, C. R., & Still, H. A. (1983). Binomial confidence
Pðμ1 ∈ I1 and μ2 ∈ I2 Þ ¼ 1 α:
intervals. Journal of the American Statistical
Association, 78, 108–116.
If the sample x1 used to estimate μ1 is indepen- Efron, B., & Tibshirani, R. J. (1993). An introduction to
dent of the sample x2 for μ2,pthen
ffiffiffiffiffiffiffiffiffiffiffiIffi1 and I2 can be the bootstrap. New York: Chapman & Hall/CRC.
simply calculated as 100 1 α% confidence Fleiss, J. R., Levin, B., & Paik, M. C. (2003). Statistical
intervals for μ1 and μ2 , respectively. methods for rates and proportions. Hoboken, NJ: Wiley.
The simultaneous confidence intervals I1 and I2 Newcombe, R. G. (1998). Two-sided confidence intervals
can be used to test whether μ1 and μ2 are equal. If for the single proportion: Comparison of seven
I1 and I2 are nonoverlapped, then μ1 and μ2 are methods. Statistics in Medicine, 17, 857–872.
significantly different from each other at a level Pagano, M., & Gauvreau, K. (2000). Principles of
biostatistics. Pacific Grove, CA: Duxbury.
less than α.
Smithson, M. (2003).Confidence intervals. Thousand
Simultaneous confidence intervals can be general- Oaks, CA: Sage.
ized into a confidence region in the multidimen- Stuart, A., Ord, K., & Arnold, S. (1998). Kendall’s
sional parameter space, especially when the esti- advanced theory of statistics: Vol. 2A. Classical
mates for parameters are not independent. A inference and the linear model (6th ed.). London:
100(1 α)% confidence region D for the parameter Arnold.
216 Confirmatory Factor Analysis
errors are not independent of one another. In such must determine whether, based on the sample
models, measurement errors can be specified to cor- covariance matrix, S, and the model-implied
relate. Correlating measurement errors allows for covariance matrix,
ðYÞ, a unique estimate of
hypotheses to be tested regarding shared variance each unknown parameter in the model can be
that is not due to the underlying factors. Specifying identified. The model is identified if the number of
the presence of correlated measurement errors in parameters to be estimated is less than or equal to
a CFA model should be based primarily on model the number of unique elements in the variance–
parsimony and, perhaps most important, on sub- covariance matrix used in the analysis. This is the
stantive considerations. case when the degrees of freedom for the model
are greater than or equal to 0. In addition, a metric
must be defined for every latent variable, including
Conducting a Confirmatory Factor Analysis
the measurement errors. This is typically done by
Structural equation modeling is also referred to as setting the metric of each latent variable equal to
covariance structure analysis because the covari- the metric of one of its indicators (i.e., fixing the
ance matrix is the focus of the analysis. The gen- loading between indicator and its respective latent
eral null hypothesis to be tested in CFA is variable to 1) but can also be done by setting the
variance of each latent variable equal to 1.
¼
ðÞ; The process of model estimation in CFA (and
SEM, in general) involves the use of a fitting func-
where
is the population covariance matrix, esti- tion such as generalized least squares or maximum
mated by the sample covariance matrix S, and likelihood (ML) to obtain estimates of model para-
ðθÞ is the model-implied covariance matrix, esti- meters that minimize the discrepancy between the
mated by
ðYÞ. Researchers using CFA seek to sample covariance matrix, S, and the model
specify a model that most precisely explains the implied covariance matrix,
ðYÞ. ML estimation
relationships among the variables in the original is the most commonly used method of model esti-
data set. In other words, the model put forth by mation. All the major software packages (e.g.,
the researcher reproduces or fits the observed sam- AMOS, EQS, LISREL, MPlus) available for posing
ple data to some degree. The more precise the fit and testing CFA models provide a form of ML
of the model to the data, the smaller the difference estimation. This method has also been extended to
between the sample covariance matrix, S, and the address issues that are common in applied settings
model-implied covariance matrix,
ðYÞ. To evalu- (e.g., nonnormal data, missing data ), making CFA
ate the null hypothesis in this manner, a sequence applicable to a wide variety of data types.
of steps common to implementing structural equa- Various goodness-of-fit indices are available for
tion models must be followed. determining whether the sample covariance
The first step, often considered the most chal- matrix, S, and the model implied covariance
lenging, requires the researcher to specify the matrix,
ðYÞ, are sufficiently equal to deem the
model to be evaluated. To do so, researchers must model meaningful. Many of these indices are
use all available information (e.g., theory, previous derived directly from the ML fitting function. The
research) to postulate the relationships they expect primary index is the chi-square. Because of the
to find in the observed data prior to the data col- problems inherent with this index (e.g., inflated by
lection process. Postulating a model may also sample size), assessing the fit of a model requires
involve the use of EFA as an initial step in develop- the use of multiple indices from the three broadly
ing the model. Different data sets must be used defined categories of indices: (1) absolute fit,
when this approach is taken because EFA results (2) parsimony correction, and (3) comparative fit.
are subject to capitalizing on chance. Following an The extensive research on fit indices has fueled the
EFA with a CFA on the same data set may com- debate and answered many questions as to which
pound the capitalization-on-chance problem and are useful and what cutoff values should be
lead to inaccurate conclusions based on the results. adopted for determining adequate model fit in
Next, the researcher must determine whether a variety of situations. Research has indicated that
the model is identified. At this stage, the research fit indices from each of these categories provide
Confirmatory Factor Analysis 219
either multiple group models or multiple indicator– relationships among constructs will require sound
multiple cause (MIMIC) models. While these measurement of the latent constructs through the
models are considered interchangeable, there are CFA approach.
advantages to employing the multiple group app-
roach. In the multiple group approach, tests of Greg William Welch
invariance across a greater number of parameters
See also Exploratory Factor Analysis; Latent Variable;
can be conducted. The MIMIC model tests for dif-
Structural Equation Modeling
ferences in intercept and factor means, essentially
providing information about which covariates have
direct effects in order to determine what grouping Further Readings
variables might be important in a multiple group Brown, T. A. (2006). Confirmatory factor analysis for
analysis. A multiple group model, on the other applied research. New York: Guilford Press.
hand, offers tests for differences in intercept and Spearman, C. E. (1904). General intelligence objectively
factor means as well as tests of other parameters determined and measured. American Journal of
such as the factor loadings, error variances and Psychology, 5, 201–293.
covariances, factor means, and factor covariances. Thurstone, L. L. (1935). The vectors of the mind.
The caveat of employing a multiple group model, Chicago: University of Chicago Press.
however, lies in the necessity of having a sufficiently
large sample size for each group, as well as addres-
sing the difficulties that arise in analyses involving
many groups. The MIMIC model is a more practi- CONFOUNDING
cal approach with smaller samples.
Confounding occurs when two variables systemati-
cally covary. Researchers are often interested in
Higher Order Models
examining whether there is a relationship between
The CFA models presented to this point have two or more variables. Understanding the relation-
been first-order models. These first-order models ship between or among variables, including
include the specification of all necessary para- whether those relationships are causal, can be
meters excluding the assumed relationship between complicated when an independent or predictor
the factors themselves. This suggests that even variable covaries with a variable other than the
though a relationship between the factors is dependent variable. When a variable systemati-
assumed to exist, the nature of that relationship is cally varies with the independent variable, the con-
‘‘unanalyzed’’ or not specified in the initial model. founding variable provides an explanation other
Higher order models are used in cases in which the than the independent variable for changes in the
relationship between the factors is of interest. A dependent variable.
higher order model focuses on examining the rela-
tionship between the first-order factors, resulting
Confounds in Correlational Designs
in a distinction between variability shared by the
first-order factors and variability left unexplained. Confounding variables are at the heart of the
The process of conducting a CFA with second- third-variable problem in correlational studies. In
order factors is essentially the same as the process a correlational study, researchers examine the rela-
of testing a CFA with first-order factors. tionship between two variables. Even if two vari-
ables are correlated, it is possible that a third,
confounding variable is responsible for the appar-
Summary
ent relationship between the two variables. For
CFA models are used in a variety of contexts. example, if there were a correlation between
Their popularity results from the need in applied icecream consumption and homicide rates, it
research for formal tests of theories involving would be a mistake to assume that eating ice
unobservable latent constructs. In general, the cream causes homicidal rages or that murderers
popularity of SEM and its use in testing causal seek frozen treats after killing. Instead, a third
Confounding 221
variable—heat—is likely responsible for both of an independent variable and random assign-
increases in ice cream consumption and homicides ment of participants to experimental conditions,
(given that heat has been shown to increase aggres- it is possible for experiments to contain con-
sion). Although one can attempt to identify and founds. An experiment may contain a confound
statistically control for confounding variables in because the experimenter intentionally or unin-
correlational studies, it is always possible that an tentionally manipulated two constructs in a way
unidentified confound is producing the correlation. that caused their systematic variation. The Illi-
nois Pilot Program on Sequential, Double-Blind
Procedures provides an example of an experi-
Confounds in Quasi-Experimental ment that suffers from a confound. In this study
commissioned by the Illinois legislature, eyewit-
and Experimental Designs
ness identification procedures conducted in sev-
The goal of quasi-experimental and experimen- eral Illinois police departments were randomly
tal studies is to examine the effect of some treat- assigned to one of two conditions. For the
ment on an outcome variable. When the sequential, double-blind condition, administra-
treatment systematically varies with some other tors who were blind to the suspect’s identity
variable, the variables are confounded, meaning showed members of a lineup to an eyewitness
that the treatment effect is comingled with the sequentially (i.e., one lineup member at a time).
effects of other variables. Common sources of For the single-blind, simultaneous condition,
confounding include history, maturation, instru- administrators knew which lineup member was
mentation, and participant selection. History the suspect and presented the witness with all
confounds may arise in quasi-experimental the lineup members at the same time. Research-
designs when an event that affects the outcome ers then examined whether witnesses identified
variable happens between pretreatment measure- the suspect or a known-innocent lineup member
ment of the outcome variable and its posttreat- at different rates depending on the procedure
ment measurement. The events that occur used. Because the mode of lineup presentation
between pre- and posttest measurement, rather (simultaneous vs. sequential) and the admini-
than the treatment, may be responsible for strator’s knowledge of the suspect’s identity
changes in the dependent variable. Maturation were confounded, it is impossible to determine
confounds are a concern if participants could whether the increase in suspect identifications
have developed—cognitively, physically, emo- found for the single-blind, simultaneous presen-
tionally—between pre- and posttest measure- tations is due to administrator knowledge, the
ment of the outcome variable. Instrumentation mode of presentation, or some interaction of the
confounds occur when different instruments are two variables. Thus, manipulation of an inde-
used to measure the dependent variable at pre- pendent variable protects against confounding
and posttest or when the instrument used to col- only when the manipulation cleanly varies a sin-
lect the observation deteriorates (e.g., a spring gle construct.
loosens or wears out on a key used for respond- Confounding can also occur in experiments if
ing in a timed task). Selection confounds may be there is a breakdown in the random assignment of
present if the participants are not randomly participants to conditions. In applied research, it is
assigned to treatments (e.g., use of intact groups, not uncommon for partners in the research process
participants self-select into treatment groups). In to want an intervention delivered to people who
each case, the confound provides an alternative deserve or are in need of the intervention, resulting
explanation—an event, participant development, in the funneling of different types of participants
instrumentation changes, preexisting differences into the treatment and control conditions. Random
between groups—for any treatment effects on assignment can also fail if the study’s sample size is
the outcome variable. relatively small because in those situations even ran-
Even though the point of conducting an dom assignment may result in people with particu-
experiment is to control the effects of potentially lar characteristics appearing in treatment conditions
confounding variables through the manipulation rather than in control conditions merely by chance.
222 Congruence
a measure of the similarity of two factorial config- scores (for observations) or factor loadings (for
urations. The name congruence coefficient was variables). The congruence coefficient is denoted ’
later tailored by Ledyard R. Tucker. The congru- or sometimes rc , and it can be computed with
ence coefficient is also sometimes called a monoto- three different equivalent formulas (where T
nicity coefficient. denotes the transpose of a matrix):
The RV coefficient was introduced by Yves P
Escoufier as a measure of similarity between xi;j yi;j
i;j
squared symmetric matrices (specifically: posi- ’ ¼ rc ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u ! !ffi ð1Þ
tive semidefinite matrices) and as a theoretical u P P 2
tool to analyze multivariate techniques. The RV t x2i;j yi;j
i;j i;j
coefficient is used in several statistical tech-
niques, such as statis and distatis. In order to
compare rectangular matrices with the RV or the
Mantel coefficients, the first step is to transform vecfXgT vecfYg
these rectangular matrices into square matrices.
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ r ffi ð2Þ
T T
The Mantel coefficient was originally intro- vecfXg vecfXg vecfYg vecfYg
duced by Nathan Mantel in epidemiology but it is
now widely used in ecology.
The congruence and the Mantel coefficients tracefXYT g
are cosines (recall that the coefficient of correla- ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð3Þ
tion is a centered cosine), and as such, they take tracefXXT g tracefYYT g
values between 1 and þ 1. The RV coefficient
is also a cosine, but because it is a cosine RV Coefficient
between two matrices of scalar products (which,
technically speaking, are positive semidefinite The RV coefficient was defined by Escoufier as
matrices), it corresponds actually to a squared a similarity coefficient between positive semidefi-
cosine, and therefore the RV coefficient takes nite matrices. Escoufier and Pierre Robert pointed
values between 0 and 1. out that the RV coefficient had important mathe-
The computational formulas of these three coef- matical properties because it can be shown that
ficients are almost identical, but their usage and most multivariate analysis techniques amount to
theoretical foundations differ because these coeffi- maximizing this coefficient with suitable con-
cients are applied to different types of matrices. straints. Recall, at this point, that a matrix S is
Also, their sampling distributions differ because of called positive semidefinite when it can be
the types of matrices on which they are applied. obtained as the product of a matrix by its trans-
pose. Formally, we say that S is positive semidefi-
nite when there exists a matrix X such that
Notations and Computational Formulas
S ¼ XXT : ð4Þ
Let X be an I by J matrix and Y be an I by K
matrix. The vec operation transforms a matrix Note that as a consequence of the definition, posi-
into a vector whose entries are the elements of the tive semidefinite matrices are square and symmet-
matrix. The trace operation applies to square ric, and that their diagonal elements are always
matrices and gives the sum of the diagonal larger than or equal to zero.
elements. If S and T denote two positive semidefinite
matrices of same dimensions, the RV coefficient
between them is defined as
Congruence Coefficient
tracefST Tg
The congruence coefficient is defined when both Rν ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5Þ
matrices have the same number of rows and col- tracefST Sg × tracefTT Tg
umns (i.e., J ¼ K). These matrices can store factor
224 Congruence
Some approximations for the sampling distribu- For computing the congruence coefficient,
tions have been derived recently for the congru- these two matrices are transformed into two vec-
ence coefficient and the RV coefficient, with tors of 6 × 3 ¼ 18 elements each, and a cosine
particular attention given to the RV coefficient. (cf. Equation 1) is computed between these two
The sampling distribution for the Mantel coeffi- vectors. This gives a value of the coefficient of
cient has not been satisfactorily approximated, congruence of ’ ¼ 7381: In order to evaluate
and the statistical tests provided for this coefficient whether this value is significantly different from
rely mostly on permutation tests. zero, a permutation test with 10,000 permuta-
tions was performed. In this test, the rows of
one of the matrices were randomly permuted,
Congruence Coefficient and the coefficient of congruence was computed
for each of these 10,000 permutations. The
Recognizing that analytical methods were probability of obtaining a value of ’ ¼ :7381
unsuccessful, Bruce Korth and Tucker decided to under the null hypothesis was evaluated as the
use Monte Carlo simulations to gain some proportion of the congruence coefficients larger
insights into the sampling distribution of the than ’ ¼ :7381: This gives a value of p ¼ .0259,
congruence coefficient. Their work was com- which is small enough to reject the null hypothe-
pleted by Wendy J. Broadbooks and Patricia B. sis at the .05 alpha level, and thus one can con-
Elmore. From this work, it seems that the sam- clude that the agreement between the ratings of
pling distribution of the congruence coefficient these two experts cannot be attributed to
depends on several parameters, including the chance.
original factorial structure and the intensity of
the population coefficient, and therefore no sim-
ple picture emerges, but some approximations RV Coefficient
can be used. In particular, for testing that a con-
gruence coefficient is null in the population, an Statistical approaches for the RV coefficient
approximate conservative test is to use Fisher’s have focused on permutation tests. In this frame-
Z transform and to treat the congruence coeffi- work, the permutations are performed on the
cient like a coefficient of correlation. Broad- entries of each column of the rectangular matri-
books and Elmore have provided tables for ces X and Y used to create the matrices S and T
population values different from zero. With the or directly on the rows and columns of S and T.
availability of fast computers, these tables can It is interesting to note that work by Frédérique
easily be extended to accommodate specific Kazi-Aoual and colleagues has shown that the
cases. mean and the variance of the permutation test
distribution can be approximated directly from S
Example and T.
The first step is to derive an index of the dimen-
Here we use an example from Hervé sionality or rank of the matrices. This index,
Abdi and Dominique Valentin (2007). Two denoted βS (for matrix S ¼ XXT), is also known as
wine experts are rating 10 wines on three differ- v in the brain imaging literature, where it is called
ent scales. The results of their ratings are a sphericity index and is used as an estimation of
provided in the two matrices below, denoted X the number of degrees of freedom for multivariate
and Y: tests of the general linear model. This index
depends on the eigenvalues of the S matrix,
2 3 2 3 denoted S λ‘ ; and is defined as
1 6 7 3 6 7
65 3 277 6 37 2
6 64 4 7 P
L
66 1 17 6 17
X ¼ 6 7and Y ¼ 6 7 1 7 : ð11Þ S λ‘
tracefSg2
67 1 277 62 2 27 ‘
6 6 7 βS ¼ ¼ : ð12Þ
42 5 45 42 6 65 P
L
2 tracefSSg
S λ‘
3 4 4 1 7 5 ‘
226 Congruence
The mean of the set of permutated coefficients The problem of the lack of normality of the
between matrices S and T is then equal to permutation-based sampling distribution of the RV
pffiffiffiffiffiffiffiffiffiffi coefficient has been addressed by Moonseong Heo
βS βT and K. Ruben Gabriel, who have suggested ‘‘nor-
EðRV Þ ¼ : ð13Þ
I1 malizing’’ the sampling distribution by using a log
transformation. Recently Julie Josse, Jerome Pagès,
The case of the variance is more complex and and François Husson have refined this approach
involves computing three preliminary quantities and indicated that a gamma distribution would
for each matrix. The first quantity denoted δS is give an even better approximation.
(for matrix S) equal to
Example
P
I
s2i;i As an example, we use the two scalar product
i matrices obtained from the matrices used to illus-
δS ¼ : ð14Þ
P
L
2 trate the congruence coefficient (cf. Equation 11).
s λ‘
‘ For the present example, these original matrices
are centered (i.e., the mean of each column has
The second one is denoted αS for matrix S and is been subtracted from each element of the column)
defined as prior to computing the scalar product matrices.
Specifically, if X and Y denote the centered matri-
αS ¼ I 1 βS : ð15Þ ces derived from X and Y, we obtain the following
scalar product matrices:
The third one is denoted CS (for matrix S) and is
defined as T
S¼XX ¼
ðI 1Þ½IðI þ 1ÞδS ðI 1ÞðβS þ 2Þ 2 3
29:56 8:78 20:78 20:11 12:89 7:22
CS ¼ :
αS ðI 3Þ 6 8:78 2:89 5:89 5:56 3:44 2:11 7
6 7
ð16Þ 6 7
6 20:78 5:89 14:89 14:56 9:44 5:11 7
6 7
6 20:11 5:56 14:56 16:22 10:78 5:44 7
With these notations, the variance of the permuted 6 7
6 7
coefficients is obtained as 4 12:89 3:44 9:44 10:78 7:22 3:56 5
7:22 2:11 5:11 5:44 3:56 1:89
2IðI 1Þ þ ðI 3ÞCS CT ð19Þ
V ð RV Þ ¼ α S α T × :
IðI þ 1ÞðI 2ÞðI 1Þ3
ð17Þ and
P
I P
I
si;j ti;j
i j
RV ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
! !
u
u P I P I P I P I
t s2i;j 2
ti;j
i j i j
ð21Þ
ð29:56 × 11:81Þ þ ð8:78 × 3:69Þ þ þ ð1:89 × 12:81Þ
¼ rhffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ih iffi
2 2 2 2 2
ð29:56Þ þ ð8:78Þ þ þ ð1:89Þ ð11:81Þ þ ð3:69Þ þ þ ð12:81Þ
¼ :7936 :
To test the significance of a value of RV ¼ :7936; which is small enough to reject the null hypothesis
we first compute the following quantities: at the .05 alpha level. It is worth noting that the
normal approximation gives a more liberal (i.e.,
βS ¼ 1:0954 αS ¼ 3:9046 smaller) value of p than does the nonparametric
δS ¼ 0:2951 CS 1:3162 permutation test (which is more accurate in this
ð22Þ case because the sampling distribution of RV is not
βT ¼ 1:3851 αT ¼ 3:6149
normal).
δT ¼ 0:3666 CT ¼ 0:7045
Plugging these values into Equations 13, 17, and Mantel Coefficient
18, we find The exact sampling distribution of the Mantel
coefficient is not known. Numerical simulations
EðRV Þ ¼ 0:2464;
suggest that, when the distance matrices originate
V ðRV Þ ¼ 0:0422; and ð23Þ from different independent populations, the sam-
ZRV ¼ 2:66: pling distribution of the Mantel coefficient is sym-
metric (though not normal) with a zero mean. In
Assuming a normal distribution for the ZRV fact, Mantel, in his original paper, presented some
gives a p value of .0077, which would allow for approximations for the variance of the sampling
the rejection of the null hypothesis for the distributions of rM (derived from the permutation
observed value of the RV coefficient. test) and suggested that a normal approximation
could be used, but the problem is still open. In
Permutation Test practice, though, the probability associated to
As an alternative approach to evaluate whether a specific value of rM is derived from permutation
the value of RV ¼ :7936 is significantly different tests.
from zero, a permutation test with 10,000 permu-
Example
tations was performed. In this test, the whole set
of rows and columns (i.e., the same permutation As an example, two distance matrices derived
of I elements is used to permute rows and col- from the congruence coefficient example (cf. Equa-
umns) of one of the scalar product matrices was tion 11) are used. These distance matrices can be
randomly permuted, and the RV coefficient was computed directly from the scalar product matrices
computed for each of these 10,000 permutations. used to illustrate the computation of the RV coeffi-
The probability of obtaining a value of cient (cf. Equations 19 and 20). Specifically, if S is
RV ¼ :7936 under the null hypothesis was evalu- a scalar product matrix and if s denotes the vector
ated as the proportion of the RV coefficients larger containing the diagonal elements of S, and if 1
than RV ¼ :7936: This gave a value of p ¼ .0281, denotes an I by 1 vector of ones, then the matrix D
228 Congruence
of the squared Euclidean distances between the ele- to the pattern of similarity of the columns of the
ments of S is obtained as (cf. Equation 4): matrices and therefore will not detect similar con-
figurations when one of the configurations is
D ¼ 1sT þ s1T 2S : ð24Þ rotated or dilated. By contrast, both the RV coeffi-
cient and the Mantel coefficients are sensitive to
Using Equation 24, we transform the scalar- the whole configuration and are insensitive to
product matrices from Equations 19 and 20 into changes in configuration that involve rotation or
the following distance matrices: dilatation. The RV coefficient has the additional
2 3 merit of being theoretically linked to most multi-
0 50 86 86 11 17 variate methods and of being the base of Procrus-
6 50 0 6 8 17 97
6 7 tes methods such as statis or distatis.
6 86 6 0 2 41 27 7
D ¼ 66 7 ð25Þ
86 8 2 0 45 29 7 Hervé Abdi
6 7
4 11 17 41 45 0 25
17 9 27 29 2 0 See also Coefficients of Correlation, Alienation, and
Determination; Principal Components Analysis; R2;
and Sampling Distributions
2 3
0 21 77 42 2 9
6 21 0 22 9 17 22 7 Further Readings
6 7
6 77 22 0 27 75 88 7
T ¼ 6
6 42
7: ð26Þ Abdi, H. (2003). Multivariate analysis. In M. Lewis-
6 9 27 0 32 35 7
7 Beck, A. Bryman, & T. Futing (Eds.), Encyclopedia for
4 2 17 75 32 0 35 research methods for the social sciences. Thousand
9 22 88 35 3 0 Oaks, CA: Sage.
Abdi, H. (2007). RV coefficient and congruence coefficient.
For computing the Mantel coefficient, the upper In N. J. Salkind (Ed.), Encyclopedia of measurement
diagonal elements of each of these two matrices are and statistics. Thousand Oaks, CA: Sage.
Bedeian, A. G., Armenakis, A. A., & Randolph, W. A.
stored into a vector of 1 I × ðI 1Þ ¼ 15 elements,
2 (1988). The significance of congruence coefficients: A
and the standard coefficient of correlation is com- comment and statistical test. Journal of Management,
puted between these two vectors. This gives a value 14, 559–566.
of the Mantel coefficient of rM ¼ .5769. In order to Borg, I., & Groenen, P. (1997). Modern multidimensional
evaluate whether this value is significantly different scaling. New York: Springer Verlag.
from zero, a permutation test with 10,000 permuta- Broadbooks, W. J., & Elmore, P. B. (1987). A Monte
tions was performed. In this test, the whole set of Carlo study of the sampling distribution of the
congruence coefficient. Educational & Psychological
rows and columns (i.e., the same permutation of I
Measurement, 47, 1–11.
elements is used to permute rows and columns) of Burt, C. (1948). Factor analysis and canonical
one of the matrices was randomly permuted, and the correlations. British Journal of Psychology, Statistical
Mantel coefficient was computed for each of these Section, 1, 95–106.
10,000 permutations. The probability of obtaining Escoufier, Y. (1973). Le traitement des variables
a value of rM ¼ .5769. under the null hypothesis was vectorielles [Treatment of variable vectors].
evaluated as the proportion of the Mantel coeffi- Biometrics, 29, 751–760.
cients larger than rM ¼ .5769. This gave a value of Harman, H. H. (1976). Modern factor analysis (3rd ed.
p ¼ .0265, which is small enough to reject the null rev.). Chicago: Chicago University Press.
hypothesis at the .05 alpha level. Heo, M., & Gabriel, K. R. (1998). A permutation test of
association between configurations by means of the Rν
coefficient. Communications in Statistics Simulation &
Conclusion Computation, 27, 843–856.
Holmes, S. (2007). Multivariate analysis: The French
The congruence, RV , and Mantel coefficients all way. In Probability and statistics: Essays in honor of
measure slightly different aspects of the notion of David A. Freedman (Vol. 2, pp. 219–233).
congruence. The congruence coefficient is sensitive Beachwood, OH: Institute of Mathematical Statistics.
Construct Validity 229
Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. model with or without a clear substantive or theo-
Cambridge: Cambridge University Press. retical understanding of that dimension and thus
Korth, B. A., & Tucker, L. R. (1975). The distribution of can be used in a purely statistical sense. For exam-
chance congruence coefficients from simulated data. ple, the latent traits in item response theory analy-
Psychometrika, 40, 361–372.
sis are often introduced as latent variables but not
Manly, B. J. F. (1997). Randomization, bootstrap and
Monte Carlo methods in biology (2nd ed.). London:
associated with a particular construct until validity
Chapman and Hall. evidence supports such an association.
Schlich, P. (1996). Defining and validating assessor The object of validation has evolved with valid-
compromises about product distances and attribute ity theory. Initially, validation was construed in
correlations. In T. Näs & E. Risvik (Eds.), terms of the validity of a test. Lee Cronbach and
Multivariate analysis of data in sensory sciences others pointed out that validity depends on how
(pp. 259–306). New York: Elsevier. a test is scored. For example, detailed content cod-
Smouse, P. E., Long, J. C., & Sokal, R. R. (1986). ing of essays might yield highly valid scores
Multiple regression and correlation extensions of the
whereas general subjective judgments might not. As
Mantel test of matrix correspondence. Systematic
a result, validity theory shifted its focus from vali-
Zoology, 35, 627–632.
Worsley, K. J., & Friston, K. J. (1995). Analysis of fMRI dating tests to validating test scores. In addition, it
time-series revisited–Again. NeuroImage, 2, 173–181. became clear that the same test scores could be
used in more than one way and that the level of
validity could vary across uses. For example, the
same test scores might offer a highly valid measure
CONSTRUCT VALIDITY of intelligence but only a moderately valid indicator
of attention deficit/hyperactivity disorder. As
a result, the emphasis of validity theory again
Construct validity refers to whether the scores of shifted from test scores to test score interpretations.
a test or instrument measure the distinct dimension Yet a valid interpretation often falls short of justify-
(construct) they are intended to measure. The pres- ing a particular use. For example, an employment
ent entry discusses origins and definitions of con- test might validly measure propensity for job suc-
struct validation, methods of construct validation, cess, but another available test might do as good
the role of construct validity evidence in the valid- a job at the same cost but with less adverse impact.
ity argument, and unresolved issues in construct In such an instance, the validity of the test score
validity. interpretation for the first test would not justify its
use for employment testing. Thus, Samuel Messick
has urged that test scores are rarely interpreted in
Origins and Definitions
a vacuum as a purely academic exercise but are
Construct validation generally refers to the collec- rather collected for some purpose and put to some
tion and application of validity evidence intended use. However, in common parlance, one frequently
to support the interpretation and use of test scores expands the notion of test to refer to the entire pro-
as measures of a particular construct. The term cedure of collecting test data (testing), assigning
construct denotes a distinct dimension of individual numeric values based on the test data (scoring),
variation, but use of this term typically carries the making inferences about the level of a construct on
connotation that the construct does not allow for the basis of those scores (interpreting), and applying
direct observation but rather depends on indirect those inferences to practical decisions (use). Thus
means of measurement. As such, the term construct the term test validity lives on as shorthand for the
differs from the term variable with respect to this validity of test score interpretations and uses.
connotation. Moreover, the term construct is some- Early on, tests were thought to divide into two
times distinguished from the term latent variable types: signs and samples. If a test was interpreted as
because construct connotes a substantive interpre- a sign of something else, the something else was
tation typically embedded in a body of substantive understood as a construct, and construct validation
theory. In contrast, the term latent variable refers was deemed appropriate. For example, responses to
to a dimension of variability included in a statistical items on a personality inventory might be viewed as
230 Construct Validity
signs of personality characteristics, in which case the Content validity evidence provides evidence of
personality characteristic constitutes the construct of construct validity because it shows that the test
interest. In contrast, some tests were viewed as only properly covers the intended domain of content
samples and construct validation was not deemed related to the construct definition. As such, con-
necessary. For example, a typing test might sample struct validity has grown from humble origins as
someone’s typing and assess its speed and accuracy. one relatively esoteric form of validity to the whole
The scores on this one test (produced from a sam- of validity, and it has come to encompass other
pling of items that could appear on a test) were forms of validity evidence.
assumed to generalize merely on the basis of statisti- Messick distinguished two threats to construct
cal generalization from a sample to a population. validity. Construct deficiency applies when a test
Jane Loevinger and others questioned this distinc- fails to measure some aspects of the construct that
tion by pointing out that the test sample could never it should measure. For example, a mathematics
be a random sample of all possible exemplars of the test that failed to cover some portion of the curric-
behavior in question. For example, a person with ulum for which it was intended would demon-
high test anxiety might type differently on a typing strate this aspect of poor construct validity. In
test from the way the person types at work, and contrast, construct-irrelevant variance involves
someone else might type more consistently on a brief things that the test measures that are not related to
test than over a full workday. As a result, interpret- the construct of interest and thus should not affect
ing the sampled behavior in terms of the full range the test scores. The example of a math test that is
of generalization always extends beyond mere statis- sensitive to vocabulary level illustrates this aspect.
tical sampling to broader validity issues. For this A test with optimal construct validity therefore
reason, all tests are signs as well as samples, and measures everything that it should measure but
construct validation applies to all tests. nothing that it should not.
At one time, test validity was neatly divided into Traditionally, validation has been directed
three types: content, criterion, and construct, with toward a specific test, its scores, and their intended
the idea that one of these three types of validity interpretation and use. However, construct valida-
applied to any one type of test. However, criterion- tion increasingly conceptualizes validation as con-
related validity depends on the construct interpreta- tinuous with extended research programs into the
tion of the criterion, and test fairness often turns on construct measured by the test or tests in question.
construct-irrelevant variance in the predictor scores. This shift reflects a broader shift in the behavioral
Likewise, content validation may offer valuable evi- sciences away from operationalism, in which a vari-
dence in support of the interpretation of correct able is theoretically defined in terms of a single
answers but typically will not provide as strong operational definition, in favor of multioperational-
a line of evidence for the interpretation of incorrect ism, in which a variety of different measures trian-
answers. For example, someone might know the gulate on the same construct. As a field learns to
mathematical concepts but answer a math word measure a construct in various ways and learns
problem incorrectly because of insufficient vocabu- more about how the construct relates to other vari-
lary or culturally inappropriate examples. Because ables through evidence collected using these mea-
all tests involve interpretation of the test scores in sures, the overall understanding of the construct
terms of what they are intended to measure, con- increases. The stronger this overall knowledge base
struct validation applies to all tests. In contempo- about the construct, the more confidence one can
rary thinking, there is a suggestion that all validity have in interpreting the scores derived from a par-
should be of one type, construct validity. ticular test as measuring this construct. Moreover,
This line of development has led to unified (but the more one knows about the construct, the more
not unitary) conceptions of validity that elevate specific and varied are the consequences entailed
construct validity from one kind of validity among by interpreting test scores as measures of that con-
others to the whole of validity. Criterion-related struct. As a result, one can conceptualize construct
evidence provides evidence of construct validity by validity as broader than test validity because it
showing that test scores relate to other variables involves the collection of evidence to validate theo-
(i.e., criterion variables) in the predicted ways. ries about the underlying construct as measured by
Construct Validity 231
a variety of tests, rather than merely the interpreta- Latent class analysis offers a similar measurement
tion of scores from one particular test. model based on the same basic assumption but
applicable to situations in which the latent variable
is itself categorical. All three methods typically
Construct Validation Methodology
offer tests of goodness of fit based on the assump-
At its inception, when construct validity was con- tion of local independence and the ability of the
sidered one kind of validity appropriate to certain modeled latent variables to account for the rela-
kinds of tests, inspection of patterns of correlations tionships among the item responses.
offered the primary evidence of construct validity. An important aspect of the above types of evi-
Lee Cronbach and Paul Meehl described a nomo- dence involves the separate analysis of various
logical net as a pattern of relationships between scales or subscales. Analyzing each scale separately
variables that partly fixed the meaning of a con- does not provide evidence as strong as does ana-
struct. Later, factor analysis established itself as lyzing them together. This is because separate anal-
a primary methodology for providing evidence of yses work only with local independence of items
construct validity. Loevinger described a structural on the same scale. Analyzing multiple scales com-
aspect of construct validity as the pattern of rela- bines this evidence with evidence based on rela-
tionships between items that compose a test. Fac- tionships between items on different scales. So, for
tor analysis allows the researcher to investigate the example, three subscales might each fit a one-fac-
internal structure of item responses, and some tor model very well, but a three-factor model
combination of replication and confirmatory fac- might fail miserably when applied to all three sets
tor analysis allows the researcher to test theoretical of items together. Under a hypothetico-deductive
hypotheses about that structure. Such hypotheses framework, testing the stronger hypothesis of mul-
typically involve multiple dimensions of variation ticonstruct local independence offers more support
tapped by items on different subscales and there- to interpretations of sets of items that pass it than
fore measuring different constructs. A higher order does testing a weaker piecemeal set of hypotheses.
factor may reflect a more general construct that The issue just noted provides some interest in
comprises these subscale constructs. returning to the earlier notion of a nomological net
Item response theory typically models dichoto- as a pattern of relationships among variables in
mous or polytomous item responses in relation to which the construct of interest is embedded. The
an underlying latent trait. Although item response idea of a nomological net arose during a period
theory favors the term trait, the models apply to when causation was suspect and laws (i.e., nomic
all kinds of constructs. Historically, the emphasis relationships) were conceptualized in terms of pat-
with item response theory has been much more terns of association. In recent years, causation has
heavily on unidimensional measures and providing made a comeback in the behavioral sciences, and
evidence that items in a set all measure the same methods of modeling networks of causal relations
dimension of variation. However, recent develop- have become more popular. Path analysis can be
ments in factor analysis for dichotomous and poly- used to test hypotheses about how a variable fits
tomous items, coupled with expanded interest in into such a network of observed variables, and thus
multidimensional item response theory, have path analysis provides construct validity evidence
brought factor analysis and item response theory for test scores that fit into such a network as pre-
together under one umbrella. Item response theory dicted by the construct theory. Structural equation
models are generally equivalent to a factor analysis models allow the research to combine both ideas
model with a threshold at which item responses by including both measurement models relating
change from one discrete response to another items to latent variables (as in factor analysis) and
based on an underlying continuous dimension. structural models that embed the latent variables in
Both factor analysis and item response theory a causal network (as in path analysis). These
depend on a shared assumption of local indepen- models allow researchers to test complex hypothe-
dence, which means that if one held constant the ses and thus provide even stronger forms of
underlying latent variable, the items would no lon- construct validity evidence. When applied to pas-
ger have any statistical association between them. sively observed data, however, such causal models
232 Construct Validity
contain no magic formula for spinning causation cognitive processing involved, one can manipulate
out of correlation. Different models will fit the various cognitive subtasks required to answer
same data, and the same model will fit data gener- items and predict the difficulty of the resulting
ated by different causal mechanisms. Nonetheless, items from these manipulations.
such models allow researchers to construct highly
falsifiable hypotheses from theories about the con- Role in Validity Arguments
struct that they seek to measure.
Complementary to the above, experimental and Modern validity theory generally structures the eval-
quasi-experimental evidence also plays an impor- uation of validity on the basis of various strands of
tant role in assessing construct validity. If a test evidence in terms of the construction of a validity
measures a given construct, then efforts to manip- argument. The basic idea is to combine all available
ulate the value of the construct should result in evidence into a single argument supporting the
changes in test scores. For example, consider intended interpretation and use of the test scores.
a standard program evaluation study that demon- Recently, Michael Kane has distinguished an inter-
strates a causal effect of a particular training pro- pretive argument from the validity argument. The
gram on performance of the targeted skill set. If interpretive argument spells out the assumptions
the measure of performance is well validated and and rationale for the intended interpretation of the
the quality of the training is under question, then scores, and the validity argument supports the valid-
this study primarily provides evidence in support ity of the interpretive argument, particularly by pro-
of the training program. In contrast, however, if viding evidence in support of key assumptions. For
the training program is well validated but the per- example, an interpretive argument might indicate
formance measure is under question, then the same that an educational performance mastery test
study primarily provides evidence in support of the assumes prior exposure and practice with the mate-
construct validity of the measure. Such evidence rial. The validity argument might then provide evi-
can generally be strengthened by showing that the dence that given these assumptions, test scores
intervention affects the variables that it should but correspond to the degree of mastery.
also does not affect the variables that it should The key to developing an appropriate validity
not. Showing that a test is responsive to manipula- argument rests with identifying the most important
tion of a variable that should not affect it offers and controversial premises that require evidential
one way of demonstrating construct-irrelevant var- support. Rival hypotheses often guide this process.
iance. For example, admissions tests sometimes The two main threats to construct validity
provide information about test-taking skills in an described above yield two main types of rival
effort to minimize the responsiveness of scores to hypotheses addressed by construct validity evi-
further training in test taking. dence. For example, sensitivity to transient emo-
Susan Embretson distinguished construct repre- tional states might offer a rival hypothesis to the
sentation from nomothetic span. The latter refers validity of a personality scale related to construct-
to the external patterns of relationships with other irrelevant variance. Differential item functioning,
variables and essentially means the same thing as in which test items relate to the construct differ-
nomological net. The former refers to the cognitive ently for different groups of test takers, also relates
processes involved in answering test items. To the to construct-irrelevant variance, yielding rival
extent that answering test items involves the hypotheses about test scores related to group char-
intended cognitive processes, the construct is prop- acteristics. A rival hypothesis that a clinical depres-
erly represented, and the measurements have sion inventory captures only one aspect of
higher construct validity. As a result, explicitly depressive symptoms involves a rival hypothesis
modeling the cognitive operation involved in about construct deficiency.
answering specific item types has blossomed as
a means of evaluating construct validity, at least in
Unresolved Issues
areas in which the underlying cognitive mechan-
isms are well understood. As an example, if one A central controversy in contemporary validity
has a strong construct theory regarding the theory involves the disagreements over the breadth
Content Analysis 233
analysis but make particular demands on the tech- likelihood of cross-border hostilities as a function
nique that are not found as problematic in other of how one country’s national press portrays its
methods of inquiry. neighbor? What are a city’s problems as inferred
The reference to text is not intended to restrict from citizens’ letters to its mayor? What do school
content analysis to written material. The paren- children learn about their nation’s history through
thetical phrase ‘‘or other meaningful matter’’ is to textbooks? What criteria do Internet users employ
imply content analysis’s applicability to anything to authenticate electronic documents?
humanly significant: images, works of art, maps,
signs, symbols, postage stamps, songs, and music,
Other Conceptions
whether mass produced, created in conversations,
or private. Texts, whether composed by individual Unlike content analysis, observation and measure-
authors or produced by social institutions, are ment go directly to the phenomenon of analytic
always intended to point their users to something interest. Temperature and population statistics
beyond their physicality. However, content analy- describe tangible phenomena. Experiments with
sis does not presume that readers read a text as human participants tend to define the range of
intended by its source; in fact, authors may be responses in directly analyzable form, just as struc-
quite irrelevant, often unknown. In content analy- tured interviews delineate the interviewees’ multi-
sis, available texts are analyzed to answer research ple choices among answers to prepared interview
questions not necessarily shared by everyone. questions. Structured interviews and experiments
What distinguishes content analysis from most with participants acknowledge subjects’ responses
observational methods in the social sciences is that to meanings but bypass them by standardization.
the answers to its research questions are inferred Content analysts struggle with unstructured
from available text. Content analysts are not inter- meanings.
ested in the physicality of texts that can be Social scientific literature does contain concep-
observed, measured, and objectively described. tions of content analysis that mimic observational
The alphabetical characters of written matter, the methods, such as those of George A. Miller, who
pixels of digital images, and the sounds one can characterizes content analysis as a method for put-
manipulate at a control panel are mere vehicles of ting large numbers of units of verbal matter into
communication. What text means to somebody, analyzable categories. A definition of this kind
what it represents, highlights and excludes, provides no place for methodological standards.
encourages or deters—all these phenomena do not Berelson’s widely cited definition fares not much
reside inside a text but come to light in processes better. For him, ‘‘content analysis is a research
of someone’s reading, interpreting, analyzing, con- technique for the objective, systematic and quanti-
cluding, and in the case of content analysis, tative description of the manifest content of com-
answering pertinent research questions concerning munication’’ (p. 18). The restriction to manifest
the text’s context of use. content would rule out content analyses of psycho-
Typical research questions that content analysts therapeutic matter or of diplomatic exchanges,
might answer are, What are the consequences for both of which tend to rely on subtle clues to
heavy and light viewers of exposure to violent tele- needed inferences. The requirement of quantifica-
vision shows? What are the attitudes of a writer tion, associated with objectivity, has been chal-
on issues not mentioned? Who is the author of an lenged, especially because the reading of text is
anonymously written work? Is a suicide note real, qualitative to start and interpretive research favors
requiring intervention, or an empty threat? Which qualitative procedures without being unscientific.
of two textbooks is more readable by sixth gra- Taking the questionable attributes out of Berel-
ders? What is the likely diagnosis for a psychiatric son’s definition reduces content analysis to the sys-
patient, known through an interview or the tematic analysis of content, which relies on
responses to a Rorschach test? What is the ethnic, a metaphor of content that locates the object of
gender, or ideological bias of a newspaper? Which analysis inside the text—a conception that some
economic theory underlies the reporting of researchers believe is not only misleading but also
business news in the national press? What is the prevents the formulation of sound methodology.
Content Analysis 235
something, and is useful or effective, though not meanings reside in words, not in syntax and orga-
necessarily as content analysts conceptualize these nization; (b) meanings are shared by everyone—
things. ‘‘manifest,’’ in Berelson’s definition—as implied in
the use of published dictionaries and thesauri; and
(c) certain differentiations among word meanings
Description of Text
can be omitted in favor of the gist of semantic
Usually, the first step in a content analysis is word classes. Tagging texts is standard in several
a description of the text. Mary Bock called content computer aids for content analysis. The General
analyses that stop there ‘‘impressionistic’’ because Inquirer software, for example, assigns the words
they leave open what a description could mean. I, me, mine, and myself to the tag ‘‘self’’ and the
Three types of description may be distinguished: tags ‘‘self,’’ ‘‘selves,’’ and ‘‘others’’ to the second-
(1) selected word counts, (2) categorizations by order tag ‘‘person.’’ Where words are ambiguous,
common dictionaries or thesauri, and (3) recording such as play, the General Inquirer looks for disam-
or scaling by human coders. biguating words in the ambiguous word’s environ-
ment—looking, in the case of play, for example,
for words relating to children and toys, musical
Selected Word Counts
instruments, theatrical performances, or work—
Selected word counts can easily be obtained and thereby achieves a less ambiguous tagging.
mechanically and afford numerous comparisons by Tagging is also used to scale favorable or unfa-
sources or situations or over time. For example, the vorable attributes or assign positive and negative
12 most frequent words uttered by Paris Hilton in signs to references.
an interview with Larry King were 285 I, 66 you,
61 my, 48 like, 45 yes, 44 really, 40 me, 33 I’m, 32
Recording or Scaling by Human Coders
people, 28 they, 17 life and time, and 16 jail. That
I is by far the most frequent word may suggest that Recording or scaling by human coders is the
the interviewee talked largely about herself and her traditional and by far the most common path
own life, which incidentally included a brief visit in taken to obtain analyzable descriptions of text.
jail. Such a distribution of words is interesting not The demand for content analysis to be reliable is
only because normally one does not think about met by standard coding instructions, which all
words when listening to conversations but also coders are asked to apply uniformly to all units of
because its skewedness is quite unusual and invites analysis. Units may be words, propositions, para-
explanations. But whether Hilton is self-centered, graphs, news items, or whole publications of
whether her response was due to Larry King’s ques- printed matter; scenes, actors, episodes, or whole
tioning, how this interview differed from others he movies in the visual domain; or utterances, turns
conducted, and what the interview actually taken, themes discussed, or decisions made in
revealed to the television audience remain specula- conversations.
tion. Nevertheless, frequencies offer an alternative The use of standard coding instructions offers
to merely listening or observing. content analysts not only the possibility of analyz-
Many computer aids to content analysis start ing larger volumes of text and employing many
with words, usually omitting function words, such coders but also a choice between emic and etic
as articles, stemming them by removing grammati- descriptions—emic by relying on the very cate-
cal endings, or focusing on words of particular gories that a designated group of readers would
interest. In that process, the textual environments use to describe the textual matter, etic by deriving
of words are abandoned or, in the case of key- coding categories from the theories of the context
words in context lists, significantly reduced. that the content analysts have adopted. The latter
choice enables content analysts to describe latent
contents and approach phenomena that ordinary
Categorizing by Common Dictionaries or Thesauri
writers and readers may not be aware of. ‘‘Good’’
Categorization by common dictionaries or the- and ‘‘bad’’ are categories nearly everyone under-
sauri is based on the assumptions that (a) textual stands alike, but ‘‘prosocial’’ and ‘‘antisocial’’
Content Analysis 237
attitudes, the concept of framing, or the idea of analysts cannot bypass justifying this step. It
a numerical strength of word associations needs to would be methodologically inadmissible to claim
be carefully defined, exemplified, and tested for to have analyzed ‘‘the’’ content of a certain news
reliability. channel, as if no inference were made or as if con-
tent were contained in its transmissions, alike for
everyone, including content analysts. It is equally
Inference
inadmissible to conclude from applying a standard
Abduction coding instrument and a sound statistics on reli-
ably coded data, that the results of a content anal-
Although sampling considerations are impor-
ysis say anything about the many worlds of others.
tant in selecting texts for analysis, the type of infer-
They may represent nothing other than the content
ence that distinguishes content analysis from
analyst’s systematized conceptions.
observational methods is abduction—not induc-
Regarding the analytical construct, content
tion or deduction. Abduction proceeds from parti-
analysts face two tasks, preparatory and applied.
culars—texts—to essentially different particulars—
Before designing a content analysis, researchers
the answers to research questions. For example,
may need to test or explore available evidence,
inferring the identity of the author from textual
including theories of the stable relations on
qualities of an unsigned work; inferring levels of
grounds of which the use of analytical constructs
anxiety from speech disturbances; inferring
can be justified. After processing the textual
a source’s conceptualization from the proximity of
data, the inferences tendered will require similar
words it uses; inferring Stalin’s successor from
justifications.
public speeches by Politburo members at the occa-
sion of Stalin’s birthday; or inferring possible solu-
tions to a conflict entailed by the metaphors used Interpretation
in characterizing that conflict.
The result of an inference needs to be interpreted
so as to select among the possible answers to the
given research question. In identifying the author
Analytical Constructs
of an unsigned document, one may have to trans-
Inferences of this kind require some evidential late similarities between signed and unsigned
support that should stem from the known, documents into probabilities associated with con-
assumed, theorized, or experimentally confirmed ceivable authors. In predicting the use of a weapon
stable correlations between the textuality as system from enemy domestic propaganda, one
described and the set of answers to the research may have to extrapolate the fluctuations of men-
question under investigation. Usually, this eviden- tioning it into a set of dates. In ascertaining gender
tial support needs to be operationalized into a form biases in educational material, one may have to
applicable to the descriptions of available texts transform the frequencies of gender references and
and interpretable as answers to the research ques- their evaluation into weights of one gender over
tions. Such operationalizations can take numerous another.
forms. By intuition, one may equate a measure of Interpreting inferences in order to select among
the space devoted to a topic with the importance alternative answers to a research question can be
a source attributes to it. The relation between dif- quite rigorous. Merely testing hypotheses on the
ferent speech disturbances and the diagnosis of cer- descriptive accounts of available texts stays within
tain psychopathologies may be established by the impressionistic nature of these descriptions and
correlation. The relation between the proximity of has little to do with content analysis.
words and associations, having been experimen-
tally confirmed, may be operation-alized in cluster-
Criteria for Judging Results
ing algorithms that compute word clusters from
strings of words. There are essentially three conditions for judging
While the evidential support for the intended the acceptability of content analysis results. In the
inferences can come from anywhere, content absence of direct validating evidence for the
238 Content Validity
inferences that content analysts make, there content analysts may need to rely on indirect evi-
remain reliability and plausibility. dence. For example, when inferring the psychopa-
thology of a historical figure, accounts by the
person’s contemporaries, actions on record, or
Reliability
comparisons with today’s norms may be used to
Reliability is the ability of the research process triangulate the inferences. Similarly, when military
to be replicated elsewhere. It assures content ana- intentions are inferred from the domestic broad-
lysts that their data are rooted in shared ground casts of wartime enemies, such intentions may be
and other researchers that they can figure out what correlated with observable consequences or remain
the reported findings mean or add their own data on record, allowing validation at a later time. Cor-
to them. Traditionally, the most unreliable part of relative validity is demonstrated when the results
a content analysis is the recording, categorization, of a content analysis correlate with other variables.
or scaling of text by human coders, and content Structural validity refers to the degree to which the
analysts employing coders for this purpose are analytical construct employed does adequately
required to assess the reliability of that process model the stable relations underlying the infer-
quantitatively. Measures of reliability are provided ences, and functional validity refers to the history
by agreement coefficients with suitable reliability of the analytical construct’s successes. Semantic
interpretations, such as Scott’s π (pi) and Krippen- validity concerns the validity of the description of
dorff’s α (alpha). The literature contains recom- textual matter relative to a designated group of
mendations regarding the minimum agreement readers, and sampling validity concerns the repre-
required for an analytical process to be sufficiently sentativeness of the sampled text. Unlike in obser-
reliable. However, that minimum should be vational research, texts need to be sampled in view
derived from the consequences of answering the of their ability to provide the answers to research
research question incorrectly. Some disagreements questions, not necessarily to represent the typical
among coders may not make a difference, but content produced by their authors.
others could direct the process to a different result.
Klaus Krippendorff
CONTENT VALIDITY
Validity
In content analysis, validity may be demon- Content validity refers to the extent to which the
strated variously. The preferred validity is predic- items on a test are fairly representative of the
tive, matching the answers to the research entire domain the test seeks to measure. This entry
question with subsequently obtained facts. When discusses origins and definitions of content valida-
direct and post facto validation is not possible, tion, methods of content validation, the role of
Content Validity 239
content validity evidence in validity arguments, the ability to add in contexts outside addition
and unresolved issues in content validation. tests.
At the heart of the above issue lies the paradig-
matic shift from discrete forms of validity, each
appropriate to one kind of test, to a more unified
Origins and Definitions
approach to test validation. The term content
One of the strengths of content validation is the validity initially differentiated one form of validity
simple and intuitive nature of its basic idea, which from criterion validity (divisible into concurrent
holds that what a test seeks to measure constitutes validity and predictive validity, depending on the
a content domain and the items on the test should timing of the collection of the criterion data) and
sample from that domain in a way that makes the construct validity (which initially referred primar-
test items representative of the entire domain. ily to the pattern of correlations with other vari-
Content validation methods seek to assess this ables, the nomological net, and to the pattern of
quality of the items on a test. Nonetheless, the association between the scores on individual items
underlying theory of content validation is fraught within the test). Each type of validity arose from
with controversies and conceptual challenges. a set of practices that the field developed to
At one time, different forms of validation, and address a particular type of practical application
indeed validity, were thought to apply to different of test use. Content validity was the means of vali-
types of tests. Florence Goodenough made an dating tests used to sample a content domain and
influential distinction between tests that serve as evaluate mastery within that domain. The unified
samples and tests that serve as signs. From this view of validity initiated by Jane Loevinger and
view, personality tests offer the canonical example Lee Cronbach, and elaborated by Samuel Messick,
of tests as signs because personality tests do not sought to forge a single theory of test validation
sample from a domain of behavior that constitutes that subsumed these disparate practices.
the personality variable but rather serve to indicate The basic practical concern involved the fact
an underlying personality trait. In contrast, educa- that assessment of the representativeness of the
tional achievement tests offer the canonical exam- content domain achieved by a set of items does
ple of tests as samples because the items sample not provide a sufficient basis to evaluate the
from a knowledge or skill domain, operationally soundness of inferences from scores on the test.
defined in terms of behaviors that demonstrate For example, a student correctly answering arith-
that corresponding knowledge or skill that the test metic items at a level above chance offers stronger
measures achievement in. For example, if an addi- support for the conclusion that he or she can do
tion test contains items representative of all combi- the arithmetic involved than the same student fail-
nations of single digits, then it may adequately ing to correctly answer the items offers for the
represent addition of single-digit numbers, but it conclusion that he or she cannot do the arithmetic.
would not adequately represent addition of num- It may be that the student can correctly calculate 6
bers with more than one digit. divided by 2 but has not been exposed to the 6/2
Jane Loevinger and others have argued that the notation used in the test items. In another context,
above distinction does not hold up because all tests a conscientious employee might be rated low on
actually function as signs. The inferences drawn a performance scale because the items involve
from test scores always extend beyond the test-tak- tasks that are important to and representative of
ing behaviors themselves, but it is impossible for the domain of conscientious work behaviors, but
the test to include anything beyond test-taking opportunities for which come up extremely rarely
behaviors. Even work samples can extend only to in the course of routine work (e.g., reports defec-
samples of work gathered within the testing proce- tive equipment when encountered). Similarly, a test
dure (as opposed to portfolios, which lack the with highly representative items might have inade-
standardization of testing procedures). To return quate reliability or other deficiencies that reduce
to the above example, one does not use an addi- the validity of inferences from its scores. The tradi-
tion test to draw conclusions only about answering tional approach to dividing up types of validity
addition items on a test but seeks to generalize to and categorizing tests with respect to the
240 Content Validity
appropriate type of validation tends in practice to validation efforts that exclude content validation
encourage reliance on just one kind of validity evi- where it could provide an important and perhaps
dence for a given test. Because just one type alone, necessary line of support. These considerations
including content-related validation evidence, does have led to proposals to modify the argument
not suffice to underwrite the use of a test, the uni- approach to validation in ways that make content-
fied view sought to discourage such categorical related evidence necessary or at least strongly
typologies of either tests or validity types and recommended for tests based on sampling from
replace these with validation methods that com- a content domain.
bined different forms of evidence for the validity Contemporary approaches to content validation
of the same test. typically distinguish various aspects of content
As Stephen Sireci and others have argued, the validity. A clear domain definition is foundational
problem with the unified approach with respect to for all the other aspects of content validity because
content validation stems directly from this effort without a clear definition of the domain, test
to improve on inadequate test validation practices. developers, test users, or anyone attempting to do
A central ethos of unified approaches involves the validation research has no basis for a clear assess-
rejection of a simple checklist approach to valida- ment of the remaining aspects. This aspect of con-
tion in which completion of a fixed set of steps tent validation closely relates to the emphasis in
results in a permanently validated test that requires the Standards for Educational and Psychological
no further research or evaluation. As an antidote Testing on clearly defining the purpose of a test as
to this checklist conception, Michael Kane and the first step in test validation.
others elaborated the concept of a validity argu- A second aspect of content validity, domain rel-
ment. The basic idea was that test validation evance, draws a further connection between con-
involves building an argument that combines mul- tent validation and the intended purpose of the
tiple lines of evidence of the overall evaluation of test. Once the domain has been defined, domain
a use or interpretation of scores derived from a test. relevance describes the degree to which the defined
To avoid a checklist, the argument approach leaves domain bears importance to the purpose of the
it open to the test validator to exercise judgment test. For example, one could imagine a test that
and select the lines of evidence that are most does a very good job of sampling the skills
appropriate in a given instance. This generally required to greet visitors, identify whom they wish
involves selecting the premises of the validation to see, schedule appointments, and otherwise exer-
argument that bear the most controversy and for cise the judgment and complete the tasks required
which empirical support can be gathered within of an effective receptionist. However, if the test use
practical constraints on what amounts to a reason- involves selecting applicants for a back office sec-
able effort. One would not waste resources gather- retarial position that does not involve serving as
ing empirical evidence for claims that no one a receptionist, then the test would not have good
would question. Similarly, one would not violate domain relevance for the intended purpose. This
ethical standards in order to validate a test of neu- aspect of content validation relates to a quality of
ral functioning by damaging various portions of the defined domain independent of how well the
the cortex in order to experimentally manipulate test taps that domain.
the variable with random assignment. Nor would In contrast, domain representation does not
one waste resources on an enormous and costly evaluate the defined domain but rather evaluates
effort to test one assumption if those resources the effectiveness with which the test samples that
could be better used to test several others in a less domain. Clearly, this aspect of content validation
costly fashion. In short, the validity argument depends on the previous two. Strong content rep-
approach to test validation does not specify that resentation does not advance the quality of a test if
any particular line of evidence is required of the items represent a domain with low relevance.
a validity argument. As a result, an effort to dis- Furthermore, even if the items do represent
courage reliance on content validation evidence a domain well, the test developer has no effective
alone may have swung the pendulum too far in the means of ascertaining that fact without a clear
opposite direction by opening the door to domain definition. Domain representation can
Content Validity 241
suffer in two ways: Items on the test may fail to as noted above, even this test-centered approach to
sample some portion of the test domain, in which content validity remains relative to the purpose for
case the validity of the test suffers as a result of which one uses the test. Domain relevance depends
construct underrepresentation. Alternatively, the on this purpose, and the purpose of the test should
test might contain items from outside the test ideally shape the conceptualization of the test
domain, in which case these items introduce con- domain. However, focus on just the content of the
struct-irrelevant variance into the test total score. items allows for a broadening of content valida-
It is also possible that the test samples all and only tion beyond the conception of a test as measuring
the test domain but does so in a way that overem- a construct conceptualized as a latent variable
phasizes some areas of the domain while underem- representing a single dimension of variation. It
phasizing other areas. In such a case, the items allows, for instance, for a test domain that spans
sample the entire domain but in a nonrepresenta- a set of tasks linked another way but heteroge-
tive manner. An example would be an addition test neous in the cognitive processes involved in com-
where 75% of the items involved adding only even pleting them. An example might be the domain of
numbers and no odd numbers. tasks associated with troubleshooting a complex
An additional aspect of content validation piece of technology such as a computer network.
involves clear, detailed, and thorough documenta- No one algorithm or process might serve to trou-
tion of the test construction procedures. This bleshoot every problem in the domain, but content
aspect of content validation reflects the epistemic validation held separate from response processes
aspect of modern test validity theory: Even if a test can nonetheless apply to such a test.
provides an excellent measure of its intended con- In contrast, the idea that content validity applies
struct, test users cannot justify the use of the test to response processes existed as a minority posi-
unless they know that the test provides an excel- tion for most of the history of content validation,
lent measure. Test validation involves justifying an but has close affinities to both the unified notion
interpretation or use of a test, and content valida- of validation as an overall evaluation based on the
tion involves justifying the test domain and the sum of the available evidence and also with cogni-
effectiveness with which the test samples that tive approaches to test development and valida-
domain. Documentation of the process leading to tion. Whereas representativeness of the item
the domain definition and generation of the item content bears more on a quality of the stimulus
pool provides a valuable source of content-related materials, representativeness of the response pro-
validity evidence. One primary element of such cesses bears more on an underlying individual dif-
documentation, the test blueprint, specifies the var- ferences variable as a property of the person
ious areas of the test domain and the number of tested. Susan Embretson has distinguished con-
items from each of those areas. Documentation of struct representation, involving the extent to which
the process used to construct the test in keeping items require the cognitive processes that the test is
with the specified test blueprint thereby plays a cen- supposed to measure, from nomothetic span,
tral role in evaluating the congruency between the which is the extent to which the test bears the
test domain and the items on the test. expected patterns of association with other vari-
The earlier passages of this entry have left open ables (what Cronbach and Paul Meehl called
the question of whether content validation refers nomological network). The former involves con-
only to the items on the test or also to the pro- tent validation applied to processes whereas the
cesses involved in answering those items. Con- latter involves methods more closely associated
struct validation has its origins in a time when with criterion-related validation and construct val-
tests as the object of validation were not yet clearly idation methods.
distinguished from test scores or test score inter-
pretations. As such, most early accounts focused
Content Validation Methodology
on the items rather than the processes involved in
answering them. Understood this way, content val- Content-related validity evidence draws heavily
idation focuses on qualities of the test rather than from the test development process. The content
qualities of test scores or interpretations. However, domain should be clearly defined at the start of
242 Content Validity
this process, item specifications should be justified described above. This methodology relies on
in terms of this domain definition, item construc- a strong cognitive theory of how test takers pro-
tion should be guided and justified by the item spe- cess test items and thus applies best when item
cifications, and the overall test blueprint that response strategies are relatively well understood
assembles the test from the item pool should also and homogeneous across items. The approach
be grounded in and justified by the domain defini- sometimes bears a strong relation to the facet anal-
tion. Careful documentation of each of these pro- ysis methods of Louis Guttman in that item specifi-
cesses provides a key source of validity evidence. cations describe and quantify a variety of item
A standard method for assessing content valid- attributes, and these can be used to predict fea-
ity involves judgments by subject matter experts tures of item response patterns such as item diffi-
(SMEs) with expertise in the content of the test. culty. This approach bears directly on content
Two or more SMEs rate each item, although large validity because it requires a detailed theory relat-
or diverse tests may require different SMEs for dif- ing how items are answered to what the items
ferent items. Ratings typically involve domain rele- measure. Response process information can also
vance or importance of the content in individual be useful in extrapolating from the measured con-
test items. Good items have high means and low tent domain to broader inferences in applied test-
standard deviations, indicating high agreement ing, as described in the next section.
among raters. John Flanagan introduced a critical
incident technique for generating and evaluating
Role in Validity Arguments
performance-based items. C. H. Lawshe, Lewis
Aiken, and Ronald Hambleton each introduced At one time, the dominant approach was to iden-
quantitative measures of agreement for use with tify certain tests as the type of test to which con-
criterion-related validation research. Victor Mar- tent validation applies and rely on content validity
tuza introduced a content validity index, which evidence for the evaluation of such tests. Currently,
has generated a body of research in the nursing lit- few if any scholars would advocate sole reliance
erature. A number of authors have also explored on content validity evidence for any test. Instead,
multivariate methods for investigating and summa- content-related evidence joins with other evidence
rizing SME ratings, including factor analysis and to support key inferences and assumptions in
multidimensional scaling methods. Perhaps not a validity argument that combines various sources
surprisingly, the results can be sensitive to the of evidence to support an overall assessment of the
approach taken to structuring the judgment task. test score interpretation and use.
Statistical analysis of item scores can also be Kane has suggested a two-step approach in
used to evaluate content validity by showing that which one first constructs an argument for test
the content domain theory is consistent with the score interpretations and then evaluates that argu-
clustering of items into related sets of items by ment with a test validity argument. Kane has sug-
some statistical criteria. These methods include gested a general structure involving four key
factor analysis, multidimensional scaling methods, inferences to which content validity evidence can
and cluster analysis. Applied to content validation, contribute support. First, the prescribed scoring
these methods overlap to some degree with con- method involves an inference from observed test-
struct validation methods directed toward the taking behaviors to a specific quantification
internal structure of a test. Test developers most intended to contribute to measurement through an
often combine such methods with methods based overall quantitative summary of the test takers’
on SME ratings to lessen interpretational ambigu- responses. Second, test score interpretation
ity of the statistical results. involves generalization from the observed test
A growing area of test validation related to con- score to the defined content domain sampled by
tent involves cognitive approaches to modeling the the test items. Third, applied testing often involves
processes involved in answering specific item types. a further inference that extrapolates from the mea-
Work by Embretson and Robert Mislevy exempli- sured content domain to a broader domain of
fies this approach, and such approaches focus on inference that the test does not fully sample.
the construct representation aspect of test validity Finally, most applied testing involves a final set of
Contrast Analysis 243
inferences from the extrapolated level of perfor- Kane, M. (2006). Content-related validity evidence in test
mance to implications for actions and decisions development. In S. M. Downing & T. M. Haladyna
applied to a particular test taker who earns a par- (Eds.), Handbook of test development (pp. 131–153).
ticular test score. Mahwah, NJ: Lawrence Erlbaum.
McKenzie, J. F., Wood, M. L., Kotecki, J. E., Clark, J. K.,
Interpretation of statistical models used to pro-
& Brey, R. A. (1999). Research notes establishing
vide criterion- and construct-related validity evi- content validity: Using qualitative and quantitative
dence would generally remain indeterminate were steps. American Journal of Health Behavior, 23, 311–
it not for the grounding of test score interpreta- 318.
tions provided by content-related evidence. While Popham, W. J. (1992). Appropriate expectations for
not a fixed foundation for inference, content- content judgments regarding teacher licensure tests.
related evidence provides a strong basis for taking Applied Measurement in Education, 5, 285–301.
one interpretation of a nomothetic structure as Sireci, S. (1998). The construct of content validity. Social
more plausible than various rival hypotheses. As Indicators Research, 45, 83–117.
such, content-related validity evidence continues to
play an important role in test development and
complements other forms of validity evidence in
validity arguments. CONTRAST ANALYSIS
A standard analysis of variance (ANOVA) pro-
Unresolved Issues vides an F test, which is called an omnibus test
As validity theory continues to evolve, a number because it reflects all possible differences between
of issues in content validation remain unresolved. the means of the groups analyzed by the ANOVA.
For instance, the relative merits of restricting con- However, most experimenters want to draw con-
tent validation to test content or expanding it to clusions more precise than ‘‘the experimental
involve item response processes warrant further manipulation has an effect on participants’ behav-
attention. A variety of aspects of content validity ior.’’ Precise conclusions can be obtained from
have been identified, suggesting a multidimensional contrast analysis because a contrast expresses a spe-
attribute of tests, but quantitative assessments of cific question about the pattern of results of an
content validity generally emphasize single-number ANOVA. Specifically, a contrast corresponds to
summaries. Finally, the ability to evaluate content a prediction precise enough to be translated into
validity in real time with computer-adaptive testing a set of numbers called contrast coefficients, which
remains an active area of research. reflect the prediction. The correlation between the
contrast coefficients and the observed group means
Keith A. Markus and Kellie M. Smith directly evaluates the similarity between the pre-
diction and the results.
See also Construct Validity; Criterion Validity When performing a contrast analysis, one needs
to distinguish whether the contrasts are planned or
post hoc. Planned, or a priori, contrasts are
Further Readings
selected before running the experiment. In general,
American Educational Research Association, American they reflect the hypotheses the experimenter wants
Psychological Association, & National Council on to test, and there are usually few of them. Post
Measurement in Education. (1999). Standards for hoc, or a posteriori (after the fact), contrasts are
educational and psychological testing. Washington, decided after the experiment has been run. The
DC: American Educational Research Association. goal of a posteriori contrasts is to ensure that
Crocker, L. M., Miller, D., & Franks, E. A. (1989).
unexpected results are reliable.
Quantitative methods for assessing the fit between test
and curriculum. Applied Measurement in Education,
When performing a planned analysis involving
2, 179–194. several contrasts, one needs to evaluate whether
Embretson, S. E. (1983). Construct validity: Construct these contrasts are mutually orthogonal or not.
representation versus nomothetic span. Psychological Two contrasts are orthogonal when their contrast
Bulletin, 93, 179–197. coefficients are uncorrelated (i.e., their coefficient
244 Contrast Analysis
α½PF
C1 C2 C3 C4 Mean α½PC ≈ : ð3Þ
C
1 1 2 4 2
Sidák and Bonferroni are related by the inequality
Table 3 ANOVA Table for a Replication of Smith’s Table 4 Orthogonal Contrasts for the Replication of
Experiment (1979) Smith (1979)
Source df SS MS F Pr(F) Group Group Group Group Group
P
Experimental 4 700.00 175.00 5.469 ** .00119 Contrast 1 2 3 4 5 Ca
Error 45 1,440.00 32.00 ψ1 +2 –3 +2 +2 –3 0
Total 49 2,1400.00 ψ2 +2 0 –1 –1 0 0
Source: Adapted from Smith (1979). ψ3 0 0 +1 –1 0 0
ψ4 0 +1 0 0 –1 0
Note: ** p ≤ .01.
Source: Adapted from Smith (1979).
they learned the list. The new room was located • Research Hypothesis 4. The different context
in a different part of the campus, painted grey, group differs from the placebo group.
and looked very austere.
Contrasts
3. Imaginary context. Participants were tested in
the same room as participants from Group 2. In The four research hypotheses are easily trans-
addition, they were told to try to remember the formed into statistical hypotheses. For example,
room in which they learned the list. In order to the first research hypothesis is equivalent to stating
help them, the experimenter asked them several the following null hypothesis:
questions about the room and the objects in it.
The means of the population for Groups 1, 3,
4. Photographed context. Participants were placed and 4 have the same value as the means of the
in the same condition as Group 3, and in population for Groups 2 and 5.
addition, they were shown photos of the orange This is equivalent to contrasting Groups 1, 3,
room in which they learned the list. and 4, on one hand, and Groups 2 and 5, on the
5. Placebo context. Participants were in the same other. This first contrast is denoted ψ1 :
condition as participants in Group 2. In addition,
before starting to try to recall the words, they are ψ1 ¼ 2μ1 3μ2 þ 2μ3 þ 2μ4 þ 3μ5 :
asked to perform a warm-up task, namely, to try
to remember their living room. The null hypothesis to be tested is
• Research Hypothesis 1. Groups for which the Note that the sum of the coefficients Ca is zero, as
context at test matches the context during it should be for a contrast. Table 4 shows all four
learning (i.e., is the same or is simulated by contrasts.
imaging or photography) will perform better
than groups with different or placebo contexts. Are the Contrasts Orthogonal?
• Research Hypothesis 2. The group with the same
context will differ from the group with Now the problem is to decide whether the con-
imaginary or photographed contexts. trasts constitute an orthogonal family. We check
• Research Hypothesis 3. The imaginary context that every pair of contrasts is orthogonal by using
group differs from the photographed context Equation 7. For example, Contrasts 1 and 2 are
group. orthogonal because
248 Contrast Analysis
Table 5 Steps for the Computation of SSc1 of Smith to the number of degrees of freedom of the experi-
(1979) mental sum of squares.
Group Ma Ca CaMa C2a
1 18.00 +2 +36.00 4 A Priori Nonorthogonal Contrasts
2 11.00 –3 –33.00 9 So orthogonal contrasts are relatively straight-
3 17.00 +2 +34.00 4 forward because each contrast can be evaluated on
4 19.00 +2 +38.00 4 its own. Nonorthogonal contrasts, however, are
5 10.00 –3 –30.00 9 more complex. The main problem is to assess the
0 45.00 30 importance of a given contrast conjointly with the
Source: Adapted from Smith (1979). other contrasts. There are currently two (main)
approaches to this problem. The classical
X
A ¼5
approach corrects for multiple statistical tests (e.g.,
Ca;1 Ca;2 ¼ ð2 × 2Þ þ ð3 × 0Þ þ ð2 × 1Þ using a Sidák or Bonferroni correction), but essen-
a¼1
tially evaluates each contrast as if it were coming
þ ð2 × 1Þ þ ð3 × 0Þ þ ð0 × 0Þ from a set of orthogonal contrasts. The multiple
¼ 0: regression (or modern) approach evaluates each
contrast as a predictor from a set of nonorthogo-
F test nal predictors and estimates its specific contribu-
The sum of squares and Fψ for a contrast are tion to the explanation of the dependent variable.
computed from Equations 8 and 10. For example, The classical approach evaluates each contrast for
the steps for the computations of SSψ1 are given in itself, whereas the multiple regression approach
Table 5. evaluates each contrast as a member of a set of
contrasts and estimates the specific contribution
P of each contrast in this set. For an orthogonal set
Sð Ca Ma: Þ2 10 × ð45:00Þ2
SSψ1 ¼ P 2 ¼ ¼ 675:00 of contrasts, the two approaches are equivalent.
Ca 30
MSψ1 ¼ 675:00
The Classical Approach
MSψ1 675:00
Fψ1 ¼ ¼ ¼ 21:094 : Some problems are created by the use of multi-
MSerror 32:00
ple nonorthogonal contrasts. The most important
ð11Þ
one is that the greater the number of contrasts, the
greater the risk of a Type I error. The general strat-
The significance of a contrast is evaluated with
egy adopted by the classical approach to this prob-
a Fisher distribution with 1 and A(S 1) ¼ 45
lem is to correct for multiple testing.
degrees of freedom, which gives a critical value of
4.06 for α ¼ .05 (7.23 for α ¼ .01). The sums of Sidák and Bonferroni Corrections
squares for the remaining contrasts are SSψ · 2 ¼ 0,
SSψ · 3 ¼ 20, and SSψ · 4 ¼ 5 with 1 and AðS 1Þ When a family’s contrasts are nonorthogonal,
¼ 45 degrees of freedom. Therefore, ψ2 , ψ3 , and Equation 10 gives a lower bound for [PC]. So,
ψ4 are nonsignificant. Note that the sums of squares instead of having the equality, the following
of the contrasts add up to SSexperimental . That is, inequality, called the Sidák inequality, holds:
C
SSexperimental ¼ SSψ:1 þ SSψ:2 þ SSψ:3 þ SSψ:4 α½PF ≤ 1 ð1 α½PCÞ : ð12Þ
¼ 675:00 þ 0:00 þ 20:00 þ 5:00 This inequality gives an upper bound for α[PF],
¼ 700:00 : and therefore the real value of α[PF] is smaller
than its estimated value.
When the sums of squares are orthogonal, the As earlier, we can approximate the Sidák
degrees of freedom are added the same way as inequality by Bonferroni as
the sums of squares are. This explains why the
maximum number of orthogonal contrasts is equal α½PF < Cα½PC: ð13Þ
Contrast Analysis 249
Table 6 Nonorthogonal Contrasts for the Replication Table 7 Fc Values for the Nonorthogonal Contrasts
of Smith (1979) From the Replication of Smith (1979)
Group Group Group Group Group Contrast rY:ψ r2Y:ψ Fψ pFψ
P
Contrast 1 2 3 4 5 Ca ψ1 .9820 .9643 21.0937 < .0001
ψ1 2 –3 2 2 –3 0 ψ2 –.1091 .0119 0.2604 .6123
ψ2 3 3 –2 –2 –2 0 ψ3 .5345 .2857 6.2500 .0161
ψ3 1 –4 1 1 1 0 Source: Adapted from Smith (1979).
Source: Adapted from Smith (1979).
regression analysis as the number of degrees of ψ1 and the dependent variable with the effects of
freedom of the independent variable. An obvious ψ2 and ψ3 partialled out. To evaluate the signifi-
choice for the predictors is to use a set of contrast cance of each contrast, we compute an F ratio for
coefficients. Doing so makes contrast analysis the corresponding semipartial coefficients of corre-
a particular case of multiple regression analysis. lation. This is done with the following formula:
250 Contrast Analysis
the amount of instruction experienced is held con- Cook, T. D., & Campbell, D. T. (1979). Quasi-
stant between the groups. experimentation: Design and analysis for field settings.
Matching procedures are similar to yoking, but Boston: Houghton Mifflin.
the matching occurs on characteristics of the par- Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasi-experimental designs for
ticipant, not the experience of the participant dur-
generalized causal inference. Boston: Houghton
ing the study. In other words, in matching, Mifflin.
participants in the control group are matched with World Medical Association. (2002). Declaration of
participants in the experimental group so that the Helsinki: Ethical principles for medical research
two groups have similar backgrounds. Participants involving human subjects. Journal of Postgraduate
are often matched on variables such as age, gender, Medicine, 48, 206–208.
and socioeconomic status. Both yoked and
matched control groups are used so that partici-
pants are as similar as possible.
CONTROL VARIABLES
control variables that are involved in a statistical can be undertaken for each level of a potential
reaction with other variables in the study, are spe- confounder. Within each unique value (or homoge-
cial cases which must be considered separately. neous stratum) of the potential confounder, the
This entry discusses the use of control variables relationship of interest may be observed that is not
during the design and analysis stages of a study. influenced by differences between exposed and
unexposed individuals attributable to the potential
Design Stage confounder. This technique is another example of
restriction.
There are several options for the use of control Estimates of the relationship of interest inde-
variables at the design stage. In the example about pendent of the potential confounder can also be
rates of reaction mentioned earlier, the intention achieved by the use of a matched or stratified
was to draw conclusions, at the end of the series approach in the analysis. The estimate of interest
of experiments, regarding the relationship between is calculated at all levels (or several theoretically
the reaction rates and the various reagents. If the homogeneous or equivalent strata) of the potential
investigator did not keep the temperature constant confounder, and a weighted, average effect across
among the series of experiments, difference in the strata is estimated. Techniques of this kind include
rate of reaction found at the conclusion of the the Mantel–Haenszel stratified analysis, as well as
study may have had nothing to do with different stratified (also called matched or conditional)
reagents, but be solely due to differences in tem- regression analyses. These approaches typically
perature or some combination of reagent and tem- assume that the stratum-specific effects are not dif-
perature. Restricting or specifying a narrow range ferent (i.e., no effect modification or statistical
of values for one or more potential confounders is interaction is present). Limitations of this method
frequently done in the design stage of the study, are related to the various ways strata can be
taking into consideration several factors, including formed for the various potential confounders, and
ease of implementation, convenience, simplified one may end up with small sample sizes in many
analysis, and expense. A limitation on restriction strata, and therefore the analysis may not produce
may be an inability to infer the relationship a reliable result.
between the restricted potential confounder and The most common analytic methods for using
the outcome and exposure. In addition, residual control variables is analysis of covariance and mul-
bias may occur, owing to incomplete control tiple generalized linear regression modeling.
(referred to as residual confounding). Regression techniques estimate the relationship of
Matching is a concept related to restriction. interest conditional on a fixed value of the poten-
Matching is the process of making the study group tial confounder, which is analogous to holding the
and control group similar with regard to potential value of the potential confounder constant at the
confounds. Several different methods can be level of third variable. By default, model para-
employed, including frequency matching, category meters (intercept and beta coefficients) are inter-
matching, individual matching, and caliper match- preted as though potential confounders were held
ing. As with restriction, the limitations of match- constant at their zero values. Multivariable regres-
ing include the inability to draw inferences about sion is relatively efficient at handling small num-
the control variable(s). Feasibility can be an issue, bers and easily combines variables measured on
given that a large pool of subjects may be required different scales.
to find matches. In addition, the potential for Where the potential control variable in question
residual confounding exists. is involved as part of a statistical interaction with
Both matching and restriction can be applied in an exposure variable of interest, holding the con-
the same study design for different control variables. trol variable constant at a single level through
restriction (in either the design or analysis) will
allow estimation of the effect of the exposure of
The Analysis Stage
interest and the outcome that is independent of the
There are several options for the use of control third variable, but the effect measured applies only
variables at the analysis stage. Separate analysis to (or is conditional on) the selected level of the
254 ‘‘Convergent and Discriminant Validation by the Multitrait–Multimethod Matrix’’
potential confounder. This would also be the that class, but also to those students enrolled in
stratum-specific or conditional effect. For example, biology and all their characteristics. However, in
restriction of an experiment to one gender would spite of any shortcomings, convenience sampling is
give the investigator a gender-specific estimate of still an effective tool to use in pilot settings, when
effect. instruments may still be under development and
If the third variable in question is part of a true interventions are yet to be fully designed and
interaction, the other forms of control, which per- approved.
mit multiple levels of the third variable to remain
in the study (e.g., through matching, statistical Neil J. Salkind
stratification, or multiple regression analysis),
See also Cluster Sampling; Experience Sampling Method;
should be considered critically before being
Nonprobability Sampling; Probability Sampling;
applied. Each of these approaches ignores the
Proportional Sampling; Quota Sampling; Random
interaction and may serve to mask its presence.
Sampling; Sampling; Sampling Error; Stratified
Jason D. Pole and Susan J. Bondy Sampling; Systematic Sampling
measuring the construct (multiple operationalism in different components of a trait that are function-
contrast to single operationalism). These two mea- ally different. For example, a low correlation
sures should strongly correlate but differ from mea- between an observational measure of anger and
sures that were created to assess different traits. the self-reported feeling component of anger could
Campbell and Fiske distinguished between four indicate individuals who regulated their visible
aspects of the validation process that can be ana- anger expression. In this case, a low correlation
lyzed by means of the multitrait–multimethod would not indicate that the self-report is an invalid
(MTMM) matrix. First, convergent validity is measure of the feeling component and that the
proven by the correlation of independent measure- observational measure is an invalid indicator of
ment procedures for measuring the same trait. Sec- overt anger expression. Instead, the two measures
ond, new measures of a trait should show low could be valid measures of the two different com-
correlations with measures of other traits from ponents of the anger episode that they are intended
which they should differ (discriminant validity). to measure, and different methods may be neces-
Third, each test is a trait–method unit. Conse- sary to appropriately assess these different compo-
quently, interindividual differences in test scores nents. It is also recommended that one consider
can be due to measurement features, as well as to traits that are as independent as possible. If two
the content of the trait. Fourth, in order to sepa- traits are considered independent, the heterotrait–
rate method- from trait-specific influences, and to monomethod correlations should be 0. Differences
analyze discriminant validity, more than one trait from 0 indicate the degree of a common method
and more than one method have to be considered effect.
in the validation process.
Discriminant Validity
Convergent Validity
Discriminant validity evidence is obtained if the
Convergent validity evidence is obtained if the cor- correlations of variables measuring different traits
relations of independent measures of the same trait are low. If two traits are considered independent,
(monotrait–heteromethod correlations) are signifi- the correlations of the measures of these traits
cantly different from 0 and sufficiently large. Con- should be 0. Discriminant validity requires that
vergent validity differs from reliability in the type both the heterotrait–heteromethod correlations
of methods considered. Whereas reliability is (e.g., correlation of a self-report measure of extro-
proven by correlations of maximally similar meth- version and a peer report measure of neuroticism)
ods of a trait (monotrait–monomethod correla- and the heterotrait–monomethod correlations
tions), the proof of convergent validity is the (e.g., correlations between self-report measures of
stronger, the more independent the methods are. extroversion and neuroticism) be small. These het-
For example, reliability of a self-report extrover- erotrait correlations should also be smaller than
sion questionnaire can be analyzed by the correla- the monotrait–heteromethod correlations (e.g.,
tions of two test halves of this questionnaire (split- self- and peer report correlations of extroversion)
half reliability) whereas the convergent validity of that indicate convergent validity. Moreover, the
the questionnaire can be scrutinized by its correla- patterns of correlations should be similar for the
tion with a peer report of extroversion. According monomethod and the heteromethod correlations
to Campbell and Fiske, independence of methods of different traits.
is a matter of degree, and they consider reliability
and validity as points on a continuum from reli-
Impact
ability to validity. Heterotrait–monomethod corre-
lations that do not significantly differ from 0 or According to Robert J. Sternberg, Campbell and
are relatively low could indicate that one of the Fiske’s article is the most often cited paper that has
two measures or even both measures do not ever been published in Psychological Bulletin and
appropriately measure the trait (low convergent is one of the most influential publications in psy-
validity). However, a low correlation could also chology. In an overview of then-available MTMM
show that the two different measures assess matrices, Campbell and Fiske concluded that
256 Copula Functions
almost none of these matrices fulfilled the criteria statistics, a copula is a function that links an n-
they described. Campbell and Fiske considered the dimensional cumulative distribution function to its
validation process as an iterative process that leads one-dimensional margins and is itself a continuous
to better methods for measuring psychological distribution function characterizing the depen-
constructs, and they hoped that their criteria dence structure of the model.
would contribute to the development of better Recently, in multivariate modeling, much atten-
methods. However, 33 years later, in 1992, they tion has been paid to copulas or copula functions.
concluded that the published MTMM matrices It can be shown that outside the elliptical world,
were still unsatisfactory and that many theoretical correlation cannot be used to characterize the
and methodological questions remained unsolved. dependence between two series. To say it differ-
Nevertheless, their article has had an enormous ently, the knowledge of two marginal distributions
influence on the development of more advanced and the correlation does not determine the bivari-
statistical methods for analyzing the MTMM ate distribution of the underlying series. In this
matrix, as well as for the refinement of the valida- context, the only dependence function able to sum-
tion process in many areas of psychology. marize all the information about the comovements
of the two series is a copula function. Indeed,
Michael Eid a multivariate distribution is fully and uniquely
characterized by its marginal distributions and its
See also Construct Validity; MBESS; Multitrait–
dependence structure as represented by the copula.
Multimethod Matrix; Triangulation; Validity of
Measurement
Definition and Properties
Conversely, if C is a copula and F and G are In contrast to the traditional modeling approach
distribution functions, then the function H defined that decomposes the joint density as a product of
by Equation 1 is a joint distribution function with marginal and conditional densities, Equation 3
margins F and G. states that, under appropriate conditions, the joint
A multivariate version of this theorem exists density can be written as a product of the marginal
and is presented hereafter. densities and the copula density. From Equation 3,
it is clear that the density cð u1 ; u2 ; . . . ; un Þ
Sklar’s Theorem in n Dimensions encodes information about the dependence struc-
ture among the Xi s, and the fi s describe the mar-
For any multivariate distribution function ginal behaviors. It thus shows that copulas
Fðx1 , x2 , . . . , xn Þ ¼ Pð X1 ≤ x1 , X2 ≤ ; x2 ;. . . ; Xn ≤ represent a way to extract the dependence struc-
xn Þ with continuous marginal functions ture from the joint distribution and to extricate the
Fi ð xi Þ ¼ Pð Xi ≤ xi Þ for 1 ≤ i ≤ n, there exists dependence and marginal behaviors. Hence, cop-
a unique function Cð u1 ; u2 ; . . . ; un Þ, called the ula functions offer more flexibility in modeling
copula and defined on ½ 0; 1 n → ½ 0; 1 , such that multivariate random variables. This flexibility con-
for all ðx1 ; x2 ; :::; xn Þ ∈ R n , trasts with the traditional use of the multivariate
normal distribution, in which the margins are
Fð x1 ; x2 ; . . . ; xn Þ ¼
ð2Þ assumed to be Gaussian and linked through a lin-
Cð F1 ð x1 Þ; F2 ð x2 Þ; . . . ; Fn ð xn Þ Þ: ear correlation matrix.
258 Copula Functions
bivariate normal variables with correlation not jointly elliptically distributed, and using linear
coefficient ρ. correlation as a measure of dependence in such
With ρ ¼ 0 we obtain a very important special situations might prove misleading. Two important
case of the Gaussian copula, which takes the form measures of dependence, known as Kendall’s tau
C? ðu; vÞ ¼ uv and is called product copula. The and Spearman’s rho, provide perhaps the best
importance of this copula is related to the fact that alternatives to the linear correlation coefficient as
two variables are independent if and only if their measures of dependence for nonelliptical distribu-
copula is the product copula. tions and can be expressed in terms of the underly-
ing copula. Before presenting these measures and
how they are related to copulas, we need to define
The Archimedean Copulas the concordance concept.
The Archimedean copulas are characterized by
their generator ’ through the following equation: Concordance
Cðu1 ; u2 Þ ¼ ’1 ð’ðu1 Þ þ ’ðu2 ÞÞ Let ðx1 ; y1 Þ and ðx2 ; y2 Þ be two observations
from a vector ðX; Y Þ of continuous random vari-
for u1 ; u2 ∈ ½0; 1: ables. Then ðx1 ; y1 Þ and ðx2 ; y2 Þ are said to be con-
The following table presents three parametric cordant if ðx1 x2 Þðy1 y2 Þ > 0 and discordant if
Archimedean families that have gained interest in ðx1 x2 Þðy1 y2 Þ < 0.
biostatistics, actuarial science, and management
science, namely, the Clayton copula, the Gumbel
Kendall’s tau
copula, and the Frank copula.
Generator
Kendall’s tau for a pair ðX; Y Þ, distributed
Family Cðu1 ; u2 Þ ’(t) Comment according to H, can be defined as the difference
h i
Clayton max ðuθ θ
1 þ u2 1Þ
1=θ
;0 1 θ
θ ðt 1Þ y>0 between the probabilities of concordance and dis-
θ θ
1=θ cordance for two independent pairs ðX1 ; Y1 Þ and
Gumbel exp lnðu1 Þ þðlnðu2 ÞÞ ðlnðtÞÞθ y≥1
h θu1 θu2 1Þ
i h θt i ðX2 ; Y2 Þ, each with distribution H. This gives
Frank 1θ ln 1 þ ðe ðe1Þðe
θ 1Þ ln ðeðeθ 1Þ
1Þ
y>0
τ X;Y ¼ PrfðX1 X2 ÞðY1 Y2 Þ > 0g
Each of the copulas presented in the preceding PrfðX1 X2 ÞðY1 Y2 Þ < 0g:
table is completely monotonic, and this allows for
multivariate extension. If we assume that all pairs The probabilities of concordance and discor-
of random variables have the same φ and the same dance can be evaluated by integrating over the dis-
θ , the three copula functions can be extended by tribution of ðX2 ; Y2 Þ.
using the following relation: In terms of copulas, Kendall’s tau becomes
ZZ
Cðu# 1; u# 2; . . . ; u# nÞ ¼ ’" ð1Þð’ðu# 1Þ τ XY ¼ 4 Cðu; vÞ dCðu; vÞ 1;
þ ’ðu# 2Þ þ þ ’ðu# nÞÞ ½0;12
Since the observed scores have a variance of be rewritten to say that rxx ¼ Var(T). Making this
1.00 (because of the condition imposed earlier) final substitution into Equation 12,
and because of Equation 6, the correlation
between true scores is equal to the covariance rab
ρab ¼ pffiffiffiffiffiffipffiffiffiffiffiffi ð14Þ
between true scores [i.e., Cov(Ta,Tb) ¼ rab]. Mak- raa rbb ,
ing this substitution into Equation 9:
rab which is exactly equal to Equation 1 (save for the
rab ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , notation on variable names), or the CA. Though
½VarðTa Þ þ Varðeb Þ ½VarðTb Þ þ Varðeb Þ
not provided here, a similar derivation can be used
ð10Þ to obtain Equation 2, or the CA for d values.
where all terms are as defined earlier. Note that
the same term, rab, appears on both sides of
Advanced Applications
Equation 10. By definition, rab ¼ rab; mathemati-
cally, Equation 10 can be true only if the denomi- There are some additional applications of the CA
nator is equal to 1.0. Because it was defined under relaxed assumptions. One of these primary
earlier that the variance of observed scores equals applications is when the correlation between error
1.0, Equation 9 can hold true. This requirement terms is not assumed to be zero. The CA under
is relaxed for Equation 11 with an additional this condition is as follows:
substitution.
Suppose now that a measure was free from pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rxy rex ey 1 rxx 1 ryy
measurement error, as it would be at the construct ρxy ¼ pffiffiffiffiffiffiffiffipffiffiffiffiffiffiffi , ð15Þ
rxx ryy
level. At the construct level, true relationships
among variables are being estimated. As such,
Greek letters are used to denote these relation- where rex ey is the correlation between error scores
ships. If the true relationship between variables A for variables X and Y, and the other terms are as
and B is to be estimated, Equation 10 becomes defined earlier.
It is also possible to correct part (or semipartial
rab correlations) and partial correlations for the influ-
ρab ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ,
½VarðTa Þ þ Varðea Þ ½VarðTb Þ þ Varðeb Þ ences of measurement error. Using the standard for-
ð11Þ mula for the partial correlation (in true score metric)
between variables X and Y, controlling for Z, we get
where ρab is the true correlation between variables
A and B, as defined in Equation 1. Again, because ρxy ρxz ρyz
ρxy · z ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð16Þ
variables are free from measurement error at the
1 ðρxz Þ2 1 ðρyz Þ2 Þ
construct level (i.e., Var(ea) ¼ Var(eb) ¼ 0), Equa-
tion 11 becomes
rxy Substituting terms from Equation 1 into Equation
ρxy ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð12Þ 15 yields a formula to estimate the true partial cor-
VarðTx Þ VarðTy Þ relation between variables X and Y, controlling for
Z:
Based on classical test theory, the reliability of
a variable is defined to be the ratio of true score
rxy rxz ryz
variance to observed score variance. In other pffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffi
rxx ryy rxx rzz
pffiffiffiffiffiffiffiffiffi
ryy rzz
words, ρxy · z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffirffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ð17Þ
rxz ryz
1 pffiffiffiffiffiffiffiffiffirxx rzz
1 pffiffiffiffiffiffiffiffiffi
ryy rzz
VarðTÞ
rxx ¼ : ð13Þ
VarðXÞ
Finally, it is also possible to compute the partial
Because it was defined earlier that the observed correlation between variables X and Y, controlling
score variance was equal to 1.0, Equation 13 can for Z while allowing the error terms to correlate:
264 Correlation
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rzz rxy rex ey 1 rxx 1 ryy to a cause–effect relationship between the variables
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
in question. To infer cause and effect, it is necessary
r rex ez 1 rxx 1 rzz
xz pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
to conduct a controlled experiment involving an
ryz rey ez 1 ryy 1 rzz
ρxyz ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi experimenter-manipulated independent variable in
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
rxx rzz rxz rex ez 1 rxx 1 rzz which subjects are randomly assigned to experi-
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 mental conditions. Typically, data for which a corre-
ryy rzz ryz rey ez 1 ryy 1 rzz lation coefficient is computed are also evaluated
ð18Þ with regression analysis. The latter is a methodology
for deriving an equation that can be employed to
Corrections for attenuation for part (or semi- estimate or predict a subject’s score on one variable
partial) correlations are also available under simi- from the subject’s score on another variable. This
lar conditions as Equations 16 and 17. entry discusses the history of correlation and mea-
sures for assessing correlation.
Matthew J. Borneman
as the predictor variable (represented symbolically in two-dimensional space. By examining the config-
by the letter X) and a second variable designated uration of the scatterplot, a researcher can ascertain
as the criterion variable (represented symbolically whether linear correlational analysis is best suited
by the letter Y). The product-moment correlation for evaluating the data.
is a measure of the degree to which the variables Regression analysis is employed with the data
covary (i.e., vary in relation to one another). From to derive the equation of a regression line (also
a theoretical perspective, the product-moment cor- known as the line of best fit), which is the straight
relation is the average of the products of the paired line that best describes the relationship between
standard deviation scores of subjects on the two the two variables. To be more specific, a regression
variables. The equation for computing the unbi- line is the straight line for which the sum of the
ased P estimate of the population correlation is squared vertical distances of all the points from
r ¼ ( zx zy)/(n 1). the line is minimal. When r ¼ ± 1, all the points
The value r computed for a sample correlation will fall on the regression line, and as the value of
coefficient is employed as an estimate of ρ (the r moves toward zero, the vertical distances of the
lowercase Greek letter rho), which represents the points from the line increase.
correlation between the two variables in the under- The general equation for a regression line is
0
lying population. The value of r will always Y ¼ a þ bX, where a ¼ Y intercept, b ¼ the
fall within the range of 1 to þ 1 (i.e., slope of the line (with a positive correlation yield-
1 ≤ r ≤ þ 1). The absolute value of r (i.e., jrj) ing a positively sloped line, and a negative correla-
indicates the strength of the linear relationship tion yielding a negatively sloped line), X represents
between the two variables, with the strength of the a given subject’s score on the predictor variable,
relationship increasing as the absolute value of r and Y 0 is the score on the criterion variable pre-
approaches 1. When r ¼ ± 1, within the sample dicted for the subject.
for which the correlation was computed, a subject’s An important part of regression analysis
score on the criterion variable can be predicted involves the analysis of residuals. A residual is the
perfectly from his or her score on the predictor difference between the Y 0 value predicted for a sub-
variable. As the absolute value of r deviates from 1 ject and the subject’s actual score on the criterion
and moves toward 0, the strength of the relation- variable. Use of the regression equation for predic-
ship between the variables decreases, such that tive purposes assumes that subjects for whom
when r ¼ 0, prediction of a subject’s score on the scores are being predicted are derived from the
criterion variable from his or her score on the pre- same population as the sample for which the
dictor variable will not be any more accurate than regression equation was computed. Although
a prediction that is based purely on chance. numerous hypotheses can be evaluated within the
The sign of r indicates whether the linear rela- framework of the product-moment correlation and
tionship between the two variables is direct (i.e., regression analysis, the most common null hypoth-
an increase in one variable is associated with an esis evaluated is that the underlying population
increase in the other variable) or indirect (i.e., an correlation between the variables equals zero. It is
increase in one variable is associated with a decrease important to note that in the case of a large sample
on the other variable). The closer a positive value size, computation of a correlation close to zero
of r is to þ 1, the stronger (i.e., more consistent) may result in rejection of the latter null hypothesis.
the direct relationship between the variables, and In such a case, it is critical that a researcher distin-
the closer a negative value of r is to 1, the stron- guish between statistical significance and practical
ger the indirect relationship between the two vari- significance, in that it is possible that a statistically
ables. If the relationship between the variables is significant result derived for a small correlation
best described by a curvilinear function, it is quite will be of no practical value; in other words, it will
possible that the value computed for r will be close have minimal predictive utility.
to zero. Because of the latter possibility, it is always A value computed for a product-moment corre-
recommended that a researcher construct a scatter- lation will be reliable only if certain assumptions
plot of the data. A scatterplot is a graph that sum- regarding the underlying population distribution
marizes the two scores of each subject with a point have not been violated. Among the assumptions
266 Correlation
for the product-moment correlation are the follow- involves assessing the relationship between a set of
ing: (a) the distribution of the two variables is predictor variables (i.e., two or more) and a set of
bivariate normal (i.e., each of the variables, as well criterion variables.
as the linear combination of the variables, is dis- A number of measures of association have
tributed normally), (b) there is homoscedasticity been developed for evaluating data in which the
(i.e., the strength of the relationship between the scores of subjects have been rank ordered or the
two variables is equal across the whole range of relationship between two or more variables is
both variables), and (c) the residuals are summarized in the format of a contingency table.
independent. Such measures may be employed when the data
are presented in the latter formats or have been
transformed from an interval or ratio format to
Alternative Correlation Coefficients
one of the latter formats because one or more of
A common criterion for determining which corre- the assumptions underlying the product-moment
lation should be employed for measuring the correlation are believed to have been saliently
degree of association between two or more vari- violated. Although, like the product-moment
ables is the levels of measurement represented by correlation, the range of values for some of the
the predictor and criterion variables. The product- measures that will be noted is between 1 and
moment correlation is appropriate to employ when þ1, others may assume only a value between
both variables represent either interval- or ratio- 0 and þ1 or may be even more limited in range.
level data. A special case of the product-moment Some alternative measures do not describe a lin-
correlation is the point-biserial correlation, which ear relationship, and in some instances a statistic
is employed when one of the variables represents other than a correlation coefficient may be
interval or ratio data and the other variable is employed to express the degree of association
represented on a dichotomous nominal scale (e.g., between the variables.
two categories, such as male and female). When Two methods of correlation that can be
the original scale of measurement for both vari- employed as measures of association when both
ables is interval or ratio but scores on one of the variables are in the form of ordinal (i.e., rank
variables have been transformed into a dichoto- order) data are Spearman’s rank order correlation
mous nominal scale, the biserial correlation is the and Kendall’s tau. Kendall’s coefficient of concor-
appropriate measure to compute. When the origi- dance is a correlation that can be employed as
nal scale of measurement for both variables is a measure of association for evaluating three or
interval or ratio but scores on both of the variables more sets of ranks.
have been transformed into a dichotomous nomi- A number of measures of correlation or associa-
nal scale, the tetrachoric correlation is employed. tion are available for evaluating categorical data
Multiple correlation involves a generalization of that are summarized in the format of a two-dimen-
the product-moment correlation to evaluate the sional contingency table. The following measures
relationship between two or more predictor vari- can be computed when both the variables are
ables with a single criterion variable, with all the dichotomous in nature: phi coefficient and Yule’s Q.
variables representing either interval or ratio data. When both variables are dichotomous or one or
Within the context of multiple correlation, partial both of the variables have more than two categories,
and semipartial correlations can be computed. A the following measures can be employed: contin-
partial correlation measures the relationship gency coefficient, Cramer’s phi, and odds ratio.
between two of the variables after any linear asso- The intraclass correlation and Cohen’s kappa
ciation one or more additional variables have with are measures of association that can be employed
the two variables has been removed. A semipartial for assessing interjudge reliability (i.e., degree of
correlation measures the relationship between two agreement among judges), the former being
of the variables after any linear association one or employed when judgments are expressed in the
more additional variables have with one of the form of interval or ratio data, and the latter when
two variables has been removed. An extension of the data are summarized in the format of a contin-
multiple correlation is canonical correlation, which gency table.
Correspondence Analysis 267
So Aloz and Zola have the same punctuation If all authors punctuate the same way, they all
style and differ only in their prolixity. A good anal- punctuate like the average writer. Therefore, in
ysis should reveal such a similarity of style, but as order to study the differences among authors, we
Figure 1 shows, PCA fails to reveal this similarity. need to analyze the matrix of deviations from the
In this figure, we have projected Aloz (as a supple- average writer. This matrix of deviations is
mentary element) in the analysis of the authors, denoted Y, and it is computed as
and Aloz is, in fact, farther away from Zola than
any other author. This example shows that using Y ¼ R 1 × cT ¼
PCA to analyze the style of the authors is not I×1
2 3
a good idea because a PCA is sensitive mainly to :0068 :0781 :0849
the number of punctuation marks rather than to 6 :0269 :0483 :0752 7
6 7
how punctuation is used. The ‘‘style’’ of the 6 7 ð6Þ
6 :0244 :0507 :0263 7
authors is, in fact, expressed by the relative fre- 6 7:
6 :0107 :0382 :0275 7
quencies of their use of the punctuation marks. 6 7
6 7
This suggests that the data matrix should be trans- 4 :0525 :1097 :0573 5
formed such that each author is described by the
:0923 :0739 :0184
proportion of his usage of the punctuation marks
rather than by the number of marks used. The Masses (Rows) and Weights (Columns)
transformed data matrix is called a row profile
matrix. In order to obtain the row profiles, we In CA, a mass is assigned to each row and
divide each row by its sum. This matrix of row a weight to each column. The mass of each row
profiles is denoted R. It is computed as reflects its importance in the sample. In other
words, the mass of each row is the proportion of
1
this row in the total of the table. The masses of the
R ¼ diag X 1 X rows are stored in a vector denoted m, which is
J×1
2 3 computed as
:2905 :4861 :2234 1
6 :2704 :5159 :2137 7
6 7 m ¼ 1 ×X× 1 ×X 1
6 :3217 :5135 :1648 7 1×I J×1 J×1
¼6
6 :2865
7 ð4Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflffl{zfflffl}
6 :6024 :1110 7
7 Inverse of the total of X
4 :2448 :6739 :0812 5
Total of the rows of X
:3896 :4903 :1201
T
¼ ½:0189 :1393 :2522 :3966 :1094 :0835 :
(where diag transforms a vector into a diagonal
ð7Þ
matrix with the elements of the vector on the diag-
onal, and J ×1 1 is a J × 1 vector of ones). From the vector m, we define the matrix of masses
The ‘‘average writer’’ would be someone who as M ¼ diag (m).
uses each punctuation mark according to its pro- The weight of each column reflects its impor-
portion in the sample. The profile of this average tance for discriminating among the authors. So
writer would be the barycenter (also called cen- the weight of a column reflects the information
troid, center of mass, or center of gravity) of the this column provides to the identification of
matrix. Here, the barycenter of R is a vector with a given row. Here, the idea is that columns that
J ¼ 3 elements. It is denoted c and computed as are used often do not provide much information,
1 and column that are used rarely provide much
T
c ¼ 1 ×X× 1 × 1 X information. A measure of how often a column
1×I J×1 1×I
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflffl{zfflffl} is used is given by the proportion of times it is
Inverse of the total of X ð5Þ used, which is equal to the value of this column’s
Total of the columns of X
component of the barycenter. Therefore the
weight of a column is computed as the inverse of
¼ ½ :2973 :5642 :1385 :
this column’s component of the barycenter.
270 Correspondence Analysis
where P is the right singular vector, Q is the left the observations onto the singular vectors). The
singular vector, and Δ is the diagonal matrix of row factor scores are stored in an I ¼ 3 L ¼ 2
the eigenvalues. From this we get matrix (where L stands for the number of nonzero
2 3 singular values) denoted F. This matrix is obtained
1:7962 0:9919 as
6 1:4198 1:4340 7
6 7 2 3
6 7
6 0:7739 0:3978 7 0:2398 0:0741
Y ¼ 6 6 7× 6 0:1895
0:6878 0:0223 7 6 0:1071 7
7
6 7 6 0:1033 0:0297 7
6 7 6 7:
4 1:6801 0:8450 5 F ¼ PΔ ¼ 6 7 ð12Þ
6 0:0918 0:0017 7
0:3561 2:6275 4 0:2243 0:0631 5
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
P ð11Þ 0:0475 0:1963
:1335 0
×
0 :0747 The variance of the factor scores for a given
|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} dimension is equal to the squared singular value
Δ
of this dimension. (The variance of the observa-
0:1090 0:4114 0:3024
: tions is computed taking into account their
0:4439 0:2769 0:1670 masses.) Or equivalently, we say that the vari-
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
QT ance of the factor scores is equal to the eigen-
value of this dimension (i.e., the eigenvalue is
The rows of the matrix X are now represented the square of the singular value). This can be
by their factor scores (which are the projections of checked as follows:
Correspondence Analysis 271
Period (a)
z Period
[0,0,1] 1.0 1 0
.9
.8
.7
Period
.6
.5
Co
od
.4
mm
ri
.4861
Pe
.2905 .3 Rousseau
a
.2
.1
.2234 0 Co .2905
m Rousseau
r
the .5 .5 ma
O
Comma
.4861 0 1
1.0 1.0
y 1 .2234 0
[0,1,0] [1,0,0] x
Other Other
Other Comma
(b)
Period
Figure 3 In Three Dimensions, the Simplex Is a Two-
1 0
Dimensional Triangle Whose Vertices Are
the Vectors [100], [010], and [001]
Co
rio
" #
mm
Pe
0:13352 0
F MF ¼ Δ ¼ L ¼
T 2
a
G
0 0:07472
ð13Þ H Z
0:0178 0 R *
¼ : C P
0 0:0056
Comma
We can display the results by plotting the factor 0 1
scores as a map on which each point represents 1 0
a row of the matrix X (i.e., each point represents Other Other
an author). This is done in Figure 2. On this map,
the first dimension seems to be related to time (the
rightmost authors are earlier authors, and the left- Figure 4 The Simplex as a Triangle
most authors are more recent), with the exception
of Giraudoux, who is a very recent author. The
second dimension singularizes Giraudoux. These a vector, it can be represented as a point in a multi-
factors will be easier to understand after we have dimensional space. Because the sum of a profile is
analyzed the columns. This can be done by analyz- equal to one, row profiles are, in fact points in a J
ing the matrix XT. Equivalently, it can be done by by 1 dimensional space.
what is called dual analysis. Also, because the components of a row profile
take value in the interval [0, 1], the points repre-
senting these row profiles can lie only in the sub-
Geometry of the Generalized
space whose ‘‘extreme points’’ have one component
Singular Value Decomposition
equal to one and all other components equal to
CA has a simple geometric interpretation. For zero. This subspace is called a simplex. For exam-
example, when a row profile is interpreted as ple, Figure 3 shows the two-dimensional simplex
272 Correspondence Analysis
Period
1 0 Period
1 0
1.8341
d Co
io
Per mm
od
Co
a
ri
Pe
G mm 1 G
H Z a 68 .3 31 R H
Z
R 28 3 *
2. C Comma
C P 0 P 1
0 1 Comma
1 0
1 0 Other Other
Other Other
corresponding to the subspace of all possible row The stretched simplex shows the whole space of
profiles with three components. As an illustration, the possible profiles. Figure 6 shows that the
the point describing Rousseau (with coordinates authors occupy a small portion of the whole space:
equal to [.2905 .4861 .2234]) is also plotted. For They do not vary much in the way they punctuate.
this particular example, the simplex is an equilat- Also, the stretched simplex represents the columns
eral triangle and, so the three-dimensional row pro- as the vertices of the simplex: The columns are
files can conveniently be represented as points on represented as row profiles with the column com-
this triangle, as illustrated in Figure 4a, which ponent being one and all the other components
shows the simplex of Figure 3 in two dimensions. being zeros. This representation is called an asym-
Figure 4b shows all six authors and the barycenter. metric representation because the rows always
The weights of the columns, which are used as have a dispersion smaller than (or equal to) the
constraints in the GSVD, also have a straightfor- columns.
ward geometric interpretation. As illustrated in
Figure 5, each side of the simplex is stretched by Distance, Inertia, Chi-Square,
a quantity equal to the square root of the dimen-
and Correspondence Analysis
sion it represents (we use the square root because
we are interested in squared distances but not in Chi-Square Distances
squared weights, so using the square root of the
weights ensures that the squared distances between In CA, the Euclidean distance in the stretched
authors will take into account the weight rather simplex is equivalent to a weighted distance in the
than the squared weights).
The masses of the rows are taken into account Period
to find the dimensions. Specifically, the first factor 1
0
is computed in order to obtain the maximum pos- riod Co
Pe mm
sible value of the sum of the masses times the a
G
squared projections of the authors’ points (i.e., the H Z
R Comma
projections have the largest possible variance). The 0 C P
second factor is constrained to be orthogonal (tak- 1
ing into account the masses) to the first one and to 1 Other 0
Other
have the largest variance for the projections.
The remaining factors are computed with simi-
lar constraints. Figure 6 shows the stretched sim- Figure 6 Correspondence Analysis: The ‘‘Stretched
plex, the author points, and the two factors (note Simplex’’ Along With the Factorial Axes
that the origin of the factors is the barycenter of Note: The projections of the authors’ points onto the
the authors). factorial axes give the factor scores.
Correspondence Analysis 273
original space. For reasons that will be made more Inertia and the Chi-Square Test
clear later, this distance is called the χ2 distance.
It is interesting that the inertia in CA is closely
The χ2 distance between two row profiles i and i0
related to the chi-square test. This test is tradition-
can be computed from the factor scores as
ally performed on a contingency table in order to
test the independence of the rows and the columns
X
L
2
2
di;i0 ¼ fi;‘ fi0 ;‘ ð14Þ of the table. Under independence, the frequency of
‘ each cell of the table should be proportional to the
product of its row and column marginal probabili-
or from the row profiles as ties. So if we denote by x þ , þ the grand total of
matrix X, the expected frequency of the cell at
X
J the ith row and jth column is denoted Ei,j and
2
2
di;i0 ¼ wj ri;j ri0 ;j : ð15Þ computed as
j
Ei;j ¼ mi cj x þ ; þ : ð19Þ
Inertia
The chi-square test statistic, denoted χ2, is com-
The variability of the row profiles relative to puted as the sum of the squared difference between
their barycenter is measured by a quantity—akin the actual values and the expected values,
to variance—called inertia and denoted I . The weighted by the expected values:
inertia of the rows to their barycenter is computed
as the weighed sum of the squared distances of the
2 X xi;j Ei;j 2
rows to their barycenter. We denote by dc,i the 2
χ ¼ : ð20Þ
(squared) distance of the ith row to the barycenter, i;j
Ei;j
computed as
When rows and columns are independent, χ2
X
J X
L
follows a chi-square distribution with
2
dc,i ¼ wj ðri,j cj Þ2 ¼ fi,l2 ð16Þ
(I 1)(J 1) degrees of freedom. Therefore, χ2
j l
can be used to evaluate the likelihood of the row
and columns independence hypothesis. The statis-
where L is the number of factors extracted by the tic χ2 can be rewritten to show its close relation-
CA of the table, [this number is smaller than or ship with the inertia of CA, namely:
equal to min(I, J) 1]. The inertia of the rows to
their barycenter is then computed as
χ2 ¼ I x þ , þ : ð21Þ
X
I
I ¼ 2
mi dc;i : ð17Þ This shows that CA analyzes—in orthogonal com-
i ponents—the pattern of deviations for
independence.
The inertia can also be expressed as the sum of the
eigenvalues (see Equation 13): Dual Analysis
X
L In a contingency table, the rows and the columns
I ¼ λ‘ ð18Þ of the table play a similar role, and therefore the
‘ analysis that was performed on the rows can also
be performed on the columns by exchanging the
This shows that in CA, each factor extracts role of the rows and the columns. This is illus-
a portion of the inertia, with the first factor trated by the analysis of the columns of matrix X,
extracting the largest portion, the second factor or equivalently by the rows of the transposed
extracting the largest portion left of the inertia, matrix XT. The matrix of column profiles for XT is
and so forth. called O (like cOlumn) and is computed as
274 Correspondence Analysis
1
O ¼ diag XT 1 XT : ð22Þ
I×1
Weights and masses of the columns analysis GSVD with the constraints imposed by the
are the inverse of their equivalent for the row two matrices W 1 (masses for the rows) and
analysis. This implies that the punctuation M 1 (weights for the columns; compare with
marks factor scores are obtained from the Equation 10):
This gives
2 3
0:3666 1:4932
6 7 :1335 0
Z ¼ 4 0:7291 0:4907 5 ×
0 :0747
2:1830 1:2056 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Δ
U ð24Þ
0:0340 0:1977 0:1952 0:2728 0:1839 0:0298
× :
0:0188 0:1997 0:1003 0:0089 0:0925 0:2195
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
VT
The factor scores for the punctuation marks are I ¼ :13352 þ :7472 ¼ :0178 þ :0056
stored in a J ¼ 3 × L ¼ 2 matrix called G,
which is computed in the same way F was com- ¼ 0:0234 : ð26Þ
puted (see Equation 12). So G is computed as
Also, the generalized singular decomposition of
2 3 one set (say, the columns) can be obtained from
0:0489 0:1115
G ¼ UΔ ¼ 4 0:0973 0:0367 5 : ð25Þ the other one (say, the rows). For example, the
0:2914 0:0901 generalized singular vectors of the analysis of the
columns can be computed directly from the analy-
sis from the rows as
Abdi
The factor scores for the rows (F) and the columns
(G) are obtained as
F ¼ D1
m S and G ¼ D1
c T: ð32Þ
Figure 7 Correspondence Analysis of the
Punctuation of Six Authors
Supplementary Elements
Notes: Comma, period, and other marks are active columns;
Rousseau, Chateaubriand, Hugo, Zola, Proust, and Often in CA we want to know the position in the
Giraudoux are active rows. Colon, semicolon, interrogation, analysis of rows or columns that were not ana-
and exclamation are supplementary columns; Abdi is
lyzed. These rows or columns are called illustrative
a supplementary row.
or supplementary rows or columns (or supplemen-
tary observations or variables). By contrast with
factor scores of the columns from their profile the appellation of supplementary (i.e., not used to
matrix (i.e., the matrix O), and from the factor compute the factors), the active elements are those
scores of the rows. Specifically, the equation that used to compute the factors. Table 2 shows the
gives the values of O from F is punctuation data with four additional columns
giving the detail of the ‘‘other punctuation marks’’
G ¼ OFΔ1 ; ð29Þ (i.e., the exclamation point, the question mark, the
semicolon, and the colon). These punctuation
marks were not analyzed for two reasons: First,
and conversely, F could be obtained from G as these marks are used rarely, and therefore they
would distort the factor space, and second, the
F ¼ RGΔ1 : ð30Þ ‘‘other’’ marks comprises all the other marks, and
therefore to analyze them with ‘‘other’’ would be
These equations are called transition formulas redundant. There is also a new author in Table 2:
from the rows to the columns (and vice versa) or We counted the marks used by a different author,
simply the transition formulas. namely, Hervé Abdi in the first chapter of his 1994
276 Correspondence Analysis
book called Les réseaux de reurores. This author supplementary column profile matrix, then Gsup,
was not analyzed because the data are available the matrix of the supplementary column factor
for only one chapter (not his complete work) and scores, is computed as
also because this author is not a literary author.
The values of the projections on the factors for Gsup ¼ Osup FΔ1 : ð35Þ
the supplementary elements are computed from
the transition formula. Specifically, a supplemen-
Table 4 gives the factor scores for the supple-
tary row is projected into the space defined using
mentary elements.
the transition formula for the active rows (cf.
Equation 30) and replacing the active row profiles
by the supplementary row profiles. So if we denote
Little Helpers: Contributions and Cosines
by Rsup the matrix of the supplementary row pro-
files, then Fsup—the matrix of the supplementary Contributions and cosines are coefficients whose
row factor scores—is computed as goal is to facilitate the interpretation. The contri-
butions identify the important elements for a given
Fsup ¼ Rsup × G × Δ1 : ð33Þ factor, whereas the (squared) cosines identify the
factors important for a given element. These coeffi-
Table 3 provides factor scores and descriptives for cients express importance as the proportion of
the rows. something in a total. The contribution is the ratio
For example, the factor scores of the author of the weighted squared projection of an element
Abdi are computed as on a factor to the sum of the weighted projections
of all the elements for this factor (which happens
Fsup ¼ Rsup GΔ1 ¼ ½ 0:0908 0:5852 : ð34Þ to be the eigenvalue of this factor). The squared
cosine is the ratio of the squared projection of an
Supplementary columns are projected into the element on a factor to the sum of the projections
factor space using the transition formula from of this element on all the factors (which happens
the active rows (cf. Equation 29) and replacing to be the squared distance from this point to the
the active column profiles by the supplementary barycenter). Contributions and squared cosines are
column profiles. If we denote by Osup the proportions that vary between 0 and 1.
Correspondence Analysis 277
ð37Þ
f2 f2 g2 g2
hi;‘ ¼ Pi;‘f 2 ¼ di;‘
2 and hj;‘ ¼ Pj;‘f 2 ¼ dj;‘
2 : ð36Þ Contributions help locating the observations
i;‘ c;i j;‘ r;j
‘ ‘ important for a given factor. An often used rule
of thumb is to consider that the important con-
Squared cosines help in locating the factors impor- tributions are larger than the average contribu-
tant for a given observation. The contributions, tion, which is equal to the number of elements
denoted b, of row i to factor l and of column j to (i.e., 1I for the rows and 1J for the columns). A
factor l are obtained as dimension is then interpreted by opposing the
278 Correspondence Principle
formalism would lead to classical physics when of formal correspondence between modern and
n → ∞, where n is the quantum number. Although classical physics.
there were many previous uses of the concept, the
important issue here is not to whom the concept Old Correspondence Principle
can be attributed, but an understanding of the var- (Numerical Correspondence)
ious ways that it can be used in scientific and phil-
Planck stressed the relation between his ‘‘radi-
osophic research.
cal’’ assumption of discrete energy levels that are
The principle is important for the continuity in
proportional to frequency, and the classical theory.
science. There are two ways of thinking about
He insisted that the terms in the new equation
such continuity. A theory T covers a set of obser-
refer to the very same classical properties. He for-
vations S. A new observation s1 is detected. T can-
mulated the CP so that the numerical value of
not explain s1. Scientists first try to adapt T to be
able to account for s1. But if T is not in principle lim ½Quantumphysics ¼ ½Classicalphysics
h→0
able to explain s1, then scientists will start to look
for another theory, T *, that can explain S and s1. He demonstrated that the radiation law for the
The scientist will try to derive T * by using CP as energy density at frequency ν,
a determining factor. In such a case, T * should
lead to T at a certain limit. 8πhv3
Nonetheless, sometimes there may be a set of uðvÞ ¼ ð1Þ
c3 ðehv=kT 1Þ,
new observations, S1, for which it turns out that
a direct derivation of T * from T that might in prin- corresponds numerically in the limit h → 0 to the
ciple account for S1 is not possible or at least does classical Rayleigh–Jeans law:
not seem to be possible. Then the scientist will try
to suggest T * separately from the accepted set of 8πkTv2
uðvÞ ¼ , ð2Þ
boundary conditions and the observed set of S and c3
S1. But because T was able to explain the set of it is
where k is Boltzmann’s constant, T is the tempera-
highly probable that T has a certain limit of correct
ture, and c is the speed of light. This kind of corre-
assumptions that led to its ability to explain S.
spondence entails that the new theory should
Therefore, any new theory T * that would account
resemble the old one not just at the mathematical
for S and S1 should resemble T at a certain limit.
level but also at the conceptual level.
This can be obtained by specifying a certain corre-
spondence limit at which the new formalism of T *
will lead to the old formalism of T. Configuration Correspondence
These two ways of obtaining T * are the general Principle (Law Correspondence)
forms of applying the correspondence principle. The configuration correspondence principle
Nevertheless, the practice of science presents us claims that the laws of new theories should corre-
with many ways of connecting T * to T or parts of spond to the laws of the old theory. In the case of
it. Hence it is important to discuss the physicists’ quantum and classical physics, quantum laws corre-
different treatments of the CP. Moreover, the inter- spond to the classical laws when the probability den-
pretation of CP and the implications of using CP sity of the quantum state coincides with the classical
will determine our picture of science and the future probability density. Take, for example, a harmonic
development of science; hence, it is important to oscillator that has a classical probability density
discuss the philosophical implications of CP and
pffiffiffiffiffiffiffiffiffiffi
the different philosophical understandings of the PC ðxÞ ¼ 1=ðπ x20 x2 Þ, ð3Þ
concept.
where x is the displacement. Now if we superim-
pose the plot of this probability onto that of the
Formal Correspondence
quantum probability density jψn j2 of the eigen-
In the current state of the relation between modern states of the system and take (the quantum num-
physics and classical physics, there are four kinds ber) n → ∞, we will obtain Figure 1 below. As
280 Correspondence Principle
facing frequency correspondence. The aim of form Brian David Josephson proved that the relation
correspondence is to prove that classical frequency between the phase difference and the voltage is
and quantum frequency have the same form. So, if given by ∂δ 2e h ∂δ
V, that is, the voltage V ¼ 2e ∂t.
∂t ¼ h
Q denotes quantum frequency, C classical fre- Now, by the assertion that the Josephson junction
quency, and E energy, then form correspondence is would behave as a classical circuit, the total cur-
satisfied if C ðEÞ has the same functional form as rent would be
Q ðEÞ. Then, by using a dipole approximation,
Liboff showed that the quantum transition Z dδ ZC d2 δ
I ¼ Ic sin δ þ þ : ð7Þ
between state s þ n and state s where s >> n gives 2Re dt 2e dt2
the relation
This equation relates the current with the phase
nQ ðEÞ ≈ nðEs =2ma2 Þ1=2 : ð5Þ difference but without any direct reference to the
voltage. Furthermore, if we apply form correspon-
He also noticed that if we treat the same system dence, Equation 7 is analogous to the equation
classically (particles of energy E in a cubical box), of a pendulum in classical mechanics. The total
the calculation of the radiated power in the nth torque τ on the pendulum would be
vibrational mode is given by the expression
d2 θ dθ
τ ¼ M 2 þ D þ τ0 sin θ, ð8Þ
nC ðEÞ ≈ nðE=2ma2 Þ1=2 : ð6Þ dt dt
where M is the moment of inertia, D is the viscous
Both frequencies have the same form, even if damping, and τ is the applied torque.
one is characterizing quantum frequency and the Both these equations have the general mathe-
other classical, and even if their experimental matical form
treatment differs. Hence, form CP is satisfied.
But such correspondence is not problem free; in dx d2 x
the classical case, E denotes the average energy Y ¼ Y0 sin x þ B þA 2 : ð9Þ
dt dt
value of an ensemble of nth harmonic frequency,
but in the quantum case, it denotes the eigenenergy This kind of correspondence can be widely used
of that level. Also, in the quantum case, the energy to help in the solution of many problems in phys-
is discrete, and the only way to assert that the ics. Therefore, to find new horizons in physics,
quantum frequency yields the classical one is by some might even think of relating some of the new
saying that when the quantum number is very big, theories that have not yet applied CP. Such is the
the number of points that coincide with the classi- case with form corresponding quantum chaos to
cal frequency will increase, using the dipole classical chaos. The argument runs as follows:
approximation, which asserts that the distance Classical chaos exists. If quantum mechanics is to
between the points in the quantum case is assumed be counted as a complete theory in describing
small. Hence the quantum case does not resemble nature, then it ought to have a notion that corre-
the classical case as such, but it coincides with the sponds to classical chaos. That notion can be
average of an ensemble of classical cases. called quantum chaos. But what are the possible
The main thrust of form correspondence is that things that resemble chaotic behavior in quantum
it can relate a branch of physics to a different systems? The reply gave rise to quantum chaos.
branch on the basis of form resemblance, such as However, it turns out that a direct correspondence
in the case of superconductivity. Here, a quantum between the notion of chaos in quantum mechan-
formula corresponds to classical equations if we ics and that in classical mechanics does not exist.
can change the quantum formula in the limit into Therefore, form correspondence would be fruit-
a form where it looks similar to a classical form. ful here. Instead of corresponding quantum chaos
The case of Josephson junctions in superconductiv- to classical chaos, we can correspond both of them
ity, which are an important factor in building to a third entity. Classical chaos goes in a certain
superconducting quantum interference devices, limit to the form ’, and quantum chaos goes to
presents a perfect demonstration of such concept. the same form at the same limit:
282 Correspondence Principle
correct. Why? Mathematically speaking, if we the old theory as a whole; we can save only the
have any finite set of observations, then there are representative part. Structural realists, such as
many possible mathematical models that can John Worrell and Elie Zahar, claim that only the
describe this set. Hence, how can we determine mathematical structure need be saved and that CP
that the model that was picked by the old science is capable of assisting us in saving it. Philip Kitcher
was the right one? asserts that only presupposition posits can survive.
But even if we accept CP as a heuristic device, Towfic Shomar claims that the dichotomy should
there are many ways that the concept can be be horizontal rather than vertical and that the only
applied. Each of these ways has a different set of parts that would survive are the phenomenological
problems for realists, and it is not possible to models (phenomenological realism). Stathis Psillos
accept any generalized form of correspondence. claims that scientific theories can be divided into
The realist position was challenged by many phi- two parts, one consisting of the claims that con-
losophers. Kuhn proved that during scientific revo- tributed to successes in science (working postu-
lutions the new science adopts a new paradigm in lates) and the other consisting of idle components.
which the wordings of the old science might con- Hans Radder, following Roy Bhaskar, thinks
tinue, but with different meanings. He demon- that progress in science is like a production line:
strated such a change with mass: The concept of There are inputs and outputs; hence our old
mass in relativity is not the same as Newtonian knowledge of theories and observations is the
mass. Feyerabend asserted that the changes between input that dictates the output (our new theories).
new science and old science make them incommen- CP is important in the process; it is a good heuris-
surable with each other. Hence, the realist notion of tic device, but it is not essential, and in many cases
approximating new theories to old ones is going it does not work.
beyond the accepted limits of approximation. But is CP a necessary claim for all kinds of real-
The other major recent attacks on realism come ism to account for developments in science? Some,
from pessimistic metainduction (Larry Laudan) on including Shomar, do not think so. Nancy Cart-
one hand and new versions of empiricist arguments wright accepts that theories are mere tools; she
(Bas van Fraassen) on the other. Van Fraassen thinks that scientific theories are patchwork that
defines his position as constructive empiricism. Lau- helps in constructing models that represent different
dan relies on the history of science to claim that the parts of nature. Some of these models depend on
realists’ explanation of the successes of science does tools borrowed from quantum mechanics and
not hold. He argues that the success of theories can- account for phenomena related to the microscopic
not offer grounds for accepting that these theories world; others use tools from classical mechanics
are true (or even approximately true). He presents and account for phenomena in the macroscopic
a list of theories that have been successful and yet world. There is no need to account for any connec-
are now acknowledged to be false. Hence, he con- tion between these models. Phenomenological real-
cludes, depending on our previous experience with ism, too, takes theories as merely tools to construct
scientific revolutions, the only reasonable induction phenomenological models that are capable of repre-
would be that it is highly probable that our current senting nature. In that case, whether the fundamen-
successful theories will turn out to be false. Van tal theories correspond to each other to some extent
Fraassen claims that despite the success of theories or not is irrelevant. The correspondence of theories
in accounting for phenomena (their empirical ade- concerns realists who think that fundamental theo-
quacy), there can never be any grounds for believ- ries represent nature and approximate its blueprint.
ing any claims beyond those about what is Currently, theoretical physics is facing a dead-
observable. That is, we cannot say that such theo- lock; as Lee Smolin and Peter Woit have argued,
ries are real or that they represent nature; we can the majority of theoretical physicists are running
only claim that they can account for the observed after the unification of all forces and laws of phys-
phenomena. ics. They are after the theory of everything. They
Recent trends in realism tried to salvage realism are convinced that science is converging toward
from these attacks, but most of these trends a final theory that represents the truth about
depend on claiming that we do not need to save nature. They are in a way in agreement with the
284 Covariate
realists, who hold that successive theories of Krajewski, W. (1977). Correspondence principle and
‘‘mature science’’ approximate the truth more and growth in science. Dordrecht, the Netherlands: Reidel.
more, so science should be in quest of the final the- Liboff, R. (1975). Bohr’s correspondence principle for
ory of the final truth. large quantum numbers. Foundations of Physics, 5(2),
271–293.
Theoretical representation might represent the
Liboff, R. (1984). The correspondence principle revisited.
truth about nature, but we can easily imagine that Physics Today, February, 50–55.
we have more than one theory to depend on. Makowski, A. (2006). A brief survey of various
Nature is complex, and in light of the richness of formulations of the correspondence principle.
nature, which is reflected in scientific practice, one European Journal of Physics, 27(5), 1133–1139.
may be unable to accept that Albert Einstein’s Radder, H. (1991). Heuristics and the generalized
request for simplicity and beauty can give the cor- correspondence principle. British Journal for the
rect picture of current science when complexity and Philosophy of Science, 42, 195–226.
diversity appear to overshadow it. The complexity Shomar, T. (2001). Structural realism and the
correspondence principle. Proceedings of the
of physics forces some toward a total disagreement
conference on Mulla Sadra and the world’s
with Einstein’s dream of finding a unified theory for
contemporary philosophy, Kish, Iran: Mulla Sudra
everything. To some, such a dream directly contra- Institute.
dicts the accepted theoretical representations of Zahar, E. (1988). Einstein’s revolution: A study in
physics. Diversity and complexity are the main heuristics. LaSalle, IL: Open Court.
characteristics of such representations.
Nonetheless, CP is an important heuristic device
that can help scientists arrive at new knowledge,
but scientists and philosophers should be careful
as to how much of CP they want to accept. As COVARIATE
long as they understand and accept that there is
more than one version of CP and as long as they Similar to an independent variable, a covariate is
accept that not all new theories can, even in princi- complementary to the dependent, or response, var-
ple, revert to old theories at a certain point, then iable. A variable is a covariate if it is related to the
they might benefit from applying CP. One other dependent variable. According to this definition,
remark of caution: Scientists and philosophers also any variable that is measurable and considered to
need to accept that old theories might be wrong; have a statistical relationship with the dependent
the wrong mathematical form may have been variable would qualify as a potential covariate. A
picked, and if they continue to accept such a form, covariate is thus a possible predictive or explana-
they will continue to uphold a false science. tory variable of the dependent variable. This may
be the reason that in regression analyses, indepen-
Towfic Shomar dent variables (i.e., the regressors) are sometimes
called covariates. Used in this context, covariates
See also Frequency Distribution; Models; Paradigm;
are of primary interest. In most other circum-
Positivism; Theory
stances, however, covariates are of no primary
interest compared with the independent variables.
Further Readings They arise because the experimental or observa-
tional units are heterogeneous. When this occurs,
Fadner, W. L. (1985). Theoretical support for the their existence is mostly a nuisance because they
generalized correspondence principle, American may interact with the independent variables to
Journal of Physics, 53, 829–838. obscure the true relationship between the depen-
French, S., & Kamminga, H. (Eds.). (1993).
dent and the independent variables. It is in this cir-
Correspondence, invariance and heuristics: Essays in
honour of Heinz Post. Dordrecht, the Netherlands:
cumstance that one needs to be aware of and
Kluwer Academic. make efforts to control the effect of covariates.
Hartmann, S. (2002). On correspondence. Studies in Viewed in this context, covariates may be called
History & Philosophy of Modern Physics, 33B, by other names, such as concomitant variables,
79–94. auxiliary variables, or secondary variables. This
Covariate 285
entry discusses methods for controlling the effects treatments. Under such circumstances, their value
of covariates and provides examples. is often observed, together with the value of the
dependent variables. The observation can be made
either before, after, or during the experiment,
Controlling Effects of Covariates depending on the nature of the covariates and their
influence on the dependent variables. The value of
Research Design
a covariate may be measured prior to the adminis-
Although covariates are neither the design vari- tration of experimental treatments if the status of
able (i.e., the independent variable) nor the pri- the covariate before entering into the experiment
mary outcome (e.g., the dependent variable) in is important or if its value changes during the
research, they are still explanatory variables that experiment. If the covariate is not affected by the
may be manipulated through experiment design so experimental treatments, it may be measured after
that their effect can be eliminated or minimized. the experiment. The researcher, however, should
Manipulation of covariates is particularly popular be mindful that measuring a covariate after an
in controlled experiments. Many techniques can experiment is done carries substantial risks unless
be used for this purpose. An example is to fix the there is strong evidence to support such an
covariates as constants across all experimental assumption. In the hypothetical nutrition study
treatments so that their effects are exerted uni- example given below, the initial height and weight
formly and can be canceled out. Another technique of pupils are not covariates that can be measured
is through randomization of experimental units after the experiment is carried out. The reason is
when assigning them to the different experimental that both height and weight are the response vari-
treatments. Key advantages of randomization are ables of the experiment, and they are influenced by
(a) to control for important known and unknown the experimental treatments. In other circum-
factors (the control for unknown factors is espe- stances, the value of the covariate is continuously
cially significant) so that all covariate effects are monitored, along with the dependent variable, dur-
minimized and all experimental units are statisti- ing an experiment. An example may be the yearly
cally comparable on the mean across treatments, mean of ocean temperatures in a long-term study
(b) to reduce or eliminate both intentional and by R. J. Beamish and D. R. Bouillon of the rela-
unintentional human biases during the experiment, tionship between the quotas of salmon fish har-
and (c) to properly evaluate error effects on the vested in the prior year and the number of salmon
experiment because of the sound probabilistic the- fish returned to the spawning ground of the rivers
ory that underlies the randomization. Randomiza- the following year, as prior research has shown
tion can be done to all experimental units at once that ocean temperature changes bear considerable
or done to experimental units within a block. influence on the life of salmon fish.
Blocking is a technique used in experimental
design to further reduce the variability in experi-
mental conditions or experimental units. Experi-
Statistical Analysis
mental units are divided into groups called blocks,
and within a group, experimental units (or condi- After the covariates are measured, a popular
tions) are assumed to be homogeneous, although statistical procedure, the analysis of covariance
they differ between groups. (ANCOVA), is then used to analyze the effect of
However ideal, there is no guarantee that ran- the design variables on the dependent variable by
domization eliminates all covariate effects. Even if explicitly incorporating covariates into the analyti-
it could remove all covariate effects, randomiza- cal model. Assume that an experiment has n design
tion may not always be feasible due to various variables and m covariates; a proper statistical
constraints in an experiment. In most circum- model for the experiment would be
stances, covariates, by their nature, are not con-
trollable through experiment designs. They are X
m
therefore not manipulated and allowed to vary yij ¼ μ þ ti þ βk ðxkij xk:: Þ þ εij ð1Þ
naturally among experimental units across k¼1
286 Covariate
where yij is the jth measurement on the dependent model is (a) to control for potential confounding,
variable (i.e., the primary outcome) in the ith treat- (b) to improve comparison across treatments, (c)
ment; μ is the overall mean; ti is the ith design var- to assess model adequacy, and (d) to expand the
iable (often called treatment in experiment design); scope of inference. The last point is supported by
xkij is the measurement on the kth covariate corre- the fact that the experiment is conducted in a more
sponding to yij; xk :: is the mean of the xkij values; realistic environment that allows a covariate to
βk is a linear (partial) regression coefficient for the change naturally, instead of being fixed at a few
kth covariate, which emphasizes the relationship artificial levels, which may or may not be represen-
between xij and yij ; and εij is a random variable that tative of the true effect of the covariate on the
follows a specific probability distribution with zero dependent variable.
mean. An inspection of this ANCOVA model From what has been discussed so far, it is clear
reveals that it is in fact an integration of an analysis that there is no fixed rule on how covariates
of variance (ANOVA) model with an ANOVA of should be dealt with in a study. Experiment con-
regression model. The regression part is on the trol can eliminate some obvious covariates, such
covariates (recall that regressors are sometimes as ethnicity, gender, and age in a research on
called covariates). A test on the hypothesis human subjects, but may not be feasible in all cir-
H0 : βk ¼ 0 confirms or rejects the null hypothesis cumstances. Statistical control is convenient, but
of no effect of covariates on the response variable. covariates need to be measured and built into
If no covariate effects exist, Equation 1 is then a model. Some covariates may not be observable,
reduced to an ordinary ANOVA model. and only so many covariates can be accommo-
Before Equation 1 can be used, though, one needs dated in a model. Omission of a key covariate
to ensure that the assumption of homogeneity of could lead to severely biased results. Circum-
regression slopes is met. Tests must be carried out on stances, therefore, dictate whether to control the
the hypothesis that β1 ¼ β2 ¼ ¼ βk ¼ 0. In effect of covariates through experiment design
practice, this is equivalent to finding no interaction measures or to account for their effect in the data
between the covariates and the independent vari- analysis step. A combination of experimental and
ables on the dependent variable. Without meeting statistical control is often required in certain cir-
this condition, tests on the adjusted means of the cumstances. Regardless of what approach is taken,
treatments are invalid. The reason is that when the the ultimate goal of controlling the covariate effect
slopes are different, the response of the treatments is to reduce the experimental error so that the
varies at the different levels of the covariates. Conse- treatment effect of primary interest can be eluci-
quently, the adjusted means do not adequately dated without the interference of covariates.
describe the treatment effects, potentially resulting in
misleading conclusions. If tests indeed confirm het-
Examples
erogeneity of slopes, alternative methods can be
sought in lieu of ANCOVA, as described by Bradley To illustrate the difference between covariate and
Eugene Huitema. Research has repeatedly demon- independent variables, consider an agricultural
strated that even randomized controlled trials can experiment on the productivity of two wheat vari-
benefit from ANCOVA in uncovering the true rela- eties under two fertilization regimes in field condi-
tionship between the independent and the dependent tions, where productivity is measured by tons of
variables. wheat grains produced per season per hectare
It must be pointed out that regardless of how (1 hectare ¼ 2.47 acres). Although researchers
the covariates are handled in a study, their value, can precisely control both the wheat varieties and
like that of the independent variables, is seldom the fertilization regimes as the primary indepen-
analyzed separately because the covariate effect is dent variables of interest in this specific study, they
of no primary interest. Instead, a detailed descrip- are left to contend with the heterogeneity of soil
tion of the covariates is often given to assist the fertility, that is, the natural micro variation in soil
reader in evaluating the results of a study. Accord- texture, soil structure, soil nutrition, soil water
ing to Fred Ramsey and Daniel Schafer, the incor- supply, soil aeration, and so on. These variables
poration of covariates in an ANCOVA statistical are natural phenomena that are beyond the control
Covariate 287
of the researchers. They are the covariates. By of no primary interest in an investigation but are
themselves alone, these covariates can influence nuisances that must be dealt with. Various control
the productivity of the wheat varieties. Left unac- measures are placed on them at either the experi-
counted for, these factors will severely distort the ment design or the data analysis step to minimize
experimental results in terms of wheat produced. the experimental error so that the treatment effects
The good news is that all these factors can be mea- on the major outcome can be better understood.
sured accurately with modern scientific instru- Without these measures, misleading conclusions
ments or methodologies. With their values known, may result, particularly when major covariates are
these variables can be incorporated into an not properly dealt with.
ANCOVA model to account for their effects on Regardless of how one decides to deal with the
the experimental results. covariate effects, either by experimental control or
In health research, suppose that investigators by data analysis techniques, one must be careful
are interested in the effect of nutrition on the not to allow the covariate to be affected by the
physical development of elementary school chil- treatments in a study. Otherwise, the covariate
dren between 6 and 12 years of age. More spe- may interact with the treatments, making a full
cifically, the researchers are interested in the accounting of the covariate effect difficult or
effect of a particular commercial dietary regime impossible.
that is highly promoted in television commer-
cials. Here, physical development is measured by Shihe Fan
both height and weight increments without the
See also Analysis of Covariance (ANCOVA);
implication of being obese. In pursuing their
Independent Variable; Randomization Tests
interest, the researchers choose a local boarding
school with a reputation for excellent healthy
nutritional programs. In this study, they want to
Further Readings
compare this commercial dietary regime with the
more traditional diets that have been offered by Beach, M. L., & Meier, P. (1989). Choosing covariates in
the school system for many years. They have the analysis of clinical trials. Controlled Clinical
done all they can to control all foreseeable Trials, 10, S161–S175.
potential confounding variables that might inter- Beamish, R. J., & Bouillon, D. R. (1993). Pacific salmon
fere with the results on the two dietary regimes. production trends in relation to climate. Canadian
Journal of Fisheries and Aquatic Sciences, 50, 1002–
However, they are incapable of controlling the
1016.
initial height and weight of the children in the Cox, D. R., & McCullagh. P. (1982). Some aspects of
study population. Randomized assignments of analysis of covariance. Biometrics, 38, 1–17.
participating pupils to the treatment groups may Huitema, B. E. (1980). The analysis of covariance and
help minimize the potential damage that the nat- alternatives. Toronto, Canada: Wiley.
ural variation in initial height and weight may Montgomery, D. C. (2001). Design and analysis of
cause to the validity of the study, but they are experiments (5th ed.). Toronto, Canada: Wiley.
not entirely sure that randomization is the right Ramsey, F. L., & Schafer, D. W. (2002). The statistical
answer to the problem. These initial heights and sleuth: A course in methods of data analysis (2nd ed.).
Pacific Grove, CA: Duxbury.
weights are the covariates, which must be mea-
Zhang, M., Tsiatis, A. A., & Davidian, M. (2008).
sured before (but not after) the study and
Improving efficiency of inferences in randomized
accounted for in an ANCOVA model in order clinical trials using auxiliary covariates. Biometrics,
for the results on dietary effects to be properly 64, 707–715.
interpreted.
Final Thoughts
Covariates are explanatory variables that exist nat- C PARAMETER
urally within research units. What differentiates
them from independent variables is that they are See Guessing Parameter
288 Criterion Problem
collection of criterion data, the researcher must What is clear is that unless attention is paid to
then gather the appropriate performance data the criterion problem, the resulting selection sys-
from the most appropriate source (e.g., archival tem will likely be problematic as well. When the
data, observation, interviews, survey or question- measured criterion is deficient, the selection system
naire responses). Following these broad steps will miss important predictors unless the unmea-
enables the researcher to address the criterion sured performance factors have the exact same
problem and develop a criterion variable much determinants as the performance factors captured
more closely related to the actual behaviors of in the measured criterion variable. When the mea-
interest. sured criterion is contaminated, relationships
between the predictors and the criterion variable
will be attenuated, as the predictors are unrelated
to the contaminating factor; when a researcher
Implications for Selection Research must choose a small number of predictors to be
included in the selection system, criterion contami-
The criterion problem has serious implications for nation can lead to useful predictors erroneously
research into the selection of students or employ- being discarded from the final set of predictors.
ees. If predictor variables are chosen on the basis Finally, when a criterion variable is simply avail-
of their relations with the easily obtainable crite- able without a clear understanding of what was
rion measure, then relying solely on that easily measured, it is extremely difficult to choose a set
available criterion variable means that the selec- of predictors.
tion system will select students or employees on
the basis of how well they will perform on the Matthew J. Borneman
behaviors captured in the easily measured criterion
See also Construct Validity; Criterion Validity; Criterion
variable, but not on the other important behaviors.
Variable; Dependent Variable; Selection; Threats to
In other words, if the criterion is deficient, then
Validity; Validity of Research Conclusions
there is a very good chance that the set of predic-
tors in the selection system will also be deficient.
Leaetta Hough and Frederick Oswald have out-
lined this problem in the employment domain with
Further Readings
respect to personality variables. They showed that
if careful consideration is not paid to assessing the Austin, J. T., & Villanova, P. (1992). The criterion
performance domains of interest (i.e., the criterion problem: 1917–1992. Journal of Applied Psychology,
ends up being deficient), then important predictors 77, 836–874.
can be omitted from the selection system. Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996).
The substantive nature of job performance variability.
The criterion problem has even more severe
In K. R. Murphy (Ed.), Individual differences and
implications for selection research in academic set- behavior in organizations (pp. 258–299). San
tings. The vast majority of the validation work has Francisco: Jossey-Bass.
been done using GPA as the ultimate criterion of Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager,
interest. Because GPA does not capture many of C. E. (1993). A theory of performance. In N. Schmitt
the inter- and intrapersonal nonintellectual perfor- & W. C. Borman (Eds.), Personnel selection (pp. 35–
mance factors that are considered important for 70). San Francisco: Jossey-Bass.
college students, the selection systems for college Hough, L. M., & Oswald, F. L. (2008). Personality
admissions are also likely deficient. Although it is testing and industrial-organizational psychology:
certainly true that admissions committees use the Reflections, progress, and prospects. Industrial &
Organizational Psychology: Perspectives on Science
available information (e.g., letters of recommenda-
and Practice, 1, 272–290.
tion, personal statements) to try to predict these Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., &
other performance factors that GPA does not Gillespie, M. A. (2004). Developing a biodata measure
assess, the extent to which the predictors relate to and situational judgment inventory as predictors of
these other performance factors is relatively college student performance. Journal of Applied
unknown. Psychology, 89, 187–207.
Criterion Validity 291
Viswesvaran, C., & Ones, D. S. (2000). Perspectives on research designs that assess criterion validity, effect
models of job performance. International Journal of sizes, and concerns that may arise in applied selec-
Selection & Assessment, 8, 216–226. tion are discussed.
Taber, T. D., & Hackman, J. D. (1976). Dimensions of
undergraduate college performance. Journal of
Applied Psychology, 61, 546–558. Nature of the Criterion
Again, the term criterion validity typically refers to
a specific predictor measure, often with the crite-
rion measure assumed. Unfortunately, this intro-
CRITERION VALIDITY duces substantial confusion into the procedure of
criterion validation. Certainly, a single predictor
Also known as criterion-related validity, or some- measure can predict an extremely wide range of
times predictive or concurrent validity, criterion criteria, as Christopher Brand has shown with gen-
validity is the general term to describe how well eral intelligence, for example. Using the same
scores on one measure (i.e., a predictor) predict example, the criterion validity estimates for gen-
scores on another measure of interest (i.e., the cri- eral intelligence vary quite a bit; general intelli-
terion). In other words, a particular criterion or gence predicts some criteria better than others.
outcome measure is of interest to the researcher; This fact further illustrates that there is no single
examples could include (but are not limited to) rat- criterion validity estimate for a single predictor.
ings of job performance, grade point average Additionally, the relationship between one predic-
(GPA) in school, a voting outcome, or a medical tor measure and one criterion variable can vary
diagnosis. Criterion validity, then, refers to the depending on other variables (i.e., moderator vari-
strength of the relationship between measures ables), such as situational characteristics, attributes
intended to predict the ultimate criterion of inter- of the sample, and particularities of the research
est and the criterion measure itself. In academic design. Issues here are highly related to the crite-
settings, for example, the criterion of interest may rion problem in predictive validation studies.
be GPA, and the predictor being studied is the
score on a standardized math test. Criterion valid-
Research Design
ity, in this context, would be the strength of the
relationship (e.g., the correlation coefficient) There are four broad research designs to assess the
between the scores on the standardized math test criterion validity for a specific predictor: predictive
and GPA. validation, quasi-predictive validation, concurrent
Some care regarding the use of the term crite- validation, and postdictive validation. Each of
rion validity needs to be employed. Typically, the these is discussed in turn.
term is applied to predictors, rather than criteria;
researchers often refer to the ‘‘criterion validity’’ of
Predictive Validation
a specific predictor. However, this is not meant to
imply that there is only one ‘‘criterion validity’’ When examining the criterion validity of a spe-
estimate for each predictor. Rather, each predictor cific predictor, the researcher is often interested in
can have different ‘‘criterion validity’’ estimates for selecting persons based on their scores on a predic-
many different criteria. Extending the above exam- tor (or set of predictor measures) that will predict
ple, the standardized math test may have one crite- how well the people will perform on the criterion
rion validity estimate for overall GPA, a higher measure. In a true predictive validation design, pre-
criterion validity estimate for science ability, and dictor measure or measures are administered to
a lower criterion validity estimate for artistic a set of applicants, and the researchers select appli-
appreciation; all three are valid criteria of interest. cants completely randomly (i.e., without regard to
Additionally, each of these estimates may be mod- their scores on the predictor measure or measures.)
erated by (i.e., have different criterion validity esti- The correlation between the predictor measure(s)
mates for) situational, sample, or research design and the criterion of interest is the index of criterion
characteristics. In this entry the criterion, the validity. This design has the advantage of being free
292 Criterion Validity
from the effects of range restriction; however, it is situation when the manner in which the incum-
an expensive design, and unfeasible in many situa- bents were selected is completely unrelated to
tions, as stakeholders are often unwilling to forgo scores on the predictor or predictors).
selecting on potentially useful predictor variables. Another potential concern regarding concurrent
validation designs is the motivation of test takers.
This is a major concern for noncognitive assess-
Quasi-Predictive Validation
ments, such as personality tests, survey data, and
Like a true predictive validation design, in background information. Collecting data on these
a quasi-predictive design, the researcher is inter- types of assessments in a concurrent validation
ested in administering a predictor (or set of predic- design provides an estimate of the maximum crite-
tors) to the applicants in order to predict their rion validity for a given assessment. This is because
scores on a criterion variable of interest. Unlike incumbents, who are not motivated to alter their
a true predictive design, in a quasi-predictive vali- scores in order to be selected, are assumed to be
dation design, the researcher will select applicants answering honestly. However, there is some con-
based on their scores on the predictor(s). As cern for intentional distortion in motivated testing
before, the correlation between the predictor(s) sessions (i.e., when applying for a job or admittance
and the criterion of interest is the index of criterion to school), which can affect criterion validity esti-
validity. However, in a quasi-predictive design, the mates. As such, one must take care when interpret-
correlation between the predictor and criterion ing criterion validity estimates in this type of
will likely be smaller because of range restriction design. If estimates under operational selection set-
due to selection on the predictor variables. Cer- tings are of interest (i.e., when there is some moti-
tainly, if the researcher has a choice between a pre- vation for distortion), then criterion validity
dictive and quasi-predictive design, the predictive estimates from a predictive or quasi-predictive
design would be preferred because it provides design are of interest; however, if estimates of maxi-
a more accurate estimate of the criterion validity mal criterion validity for the predictor(s) are of
of the predictor(s); however, quasi-predictive interest, then a concurrent design is appropriate.
designs are far more common. Although quasi-pre-
dictive designs typically suffer from range restric-
Postdictive Validation
tion problems, they have the advantage of
allowing the predictors to be used for selection Postdictive validation is an infrequently used
purposes while researchers obtain criterion validity design to assess criterion validity. At its basics,
estimates. postdictive validation assesses the criterion vari-
able first and then subsequently assesses the predic-
tor variable(s). Typically, this validation design is
Concurrent Validation
not employed because the predictor variable(s), by
In a concurrent validation design, the predic- definition, come temporally before the criterion
tor(s) of interest to the researcher are not adminis- variable is assessed. However, a postdictive valida-
tered to a set of applicants; rather, they are tion design can be especially useful, if not the only
administered only to the incumbents, or people alternative, when the criterion variable is rare or
who have already been selected. The correlation unethical to obtain. Such examples might include
between the scores on the predictors and the crite- criminal activity, abuse, or medical outcomes. In
rion measures for the incumbents serves as the cri- rare criterion instances, it is nearly impossible to
terion validity estimate for that predictor or set of know when the outcome will occur; as such, the
predictors. This design has several advantages, predictors are collected after the fact to help pre-
including cost savings due to administering the dict who is at risk for the particular criterion vari-
predictors to fewer people and reduced time to col- able. In other instances when it is extremely
lection of the criterion data. However, there are unethical to collect data on the criterion of interest
also some disadvantages, including the fact that (e.g., abuse), predictor variables are collected after
criterion validity estimates are likely to be smaller the fact in order to determine who might be at risk
as a result of range restriction (except in the rare for those criterion variables. Regardless of the
Criterion Validity 293
reason for the postdictive design, people who met Statistical Artifacts
or were assessed on the criterion variable are
Unfortunately, several statistical artifacts can
matched with other people who were not, typically
have dramatic effects on criterion validity esti-
on demographic and/or other variables. The rela-
mates, with two of the most common being mea-
tionship between the predictor measures and the
surement error and range restriction. Both of these
criterion variable assessed for the two groups
(in most applications) serve to lower the observed
serves as the estimate of criterion validity.
relationships from their true values. These effects
are increasingly important when one is comparing
the criterion validity of multiple predictors.
Effect Sizes
Range Restriction
Any discussion of criterion validity necessarily
involves a discussion of effect sizes; the results of Range restriction occurs when there is some
a statistical significance test are inappropriate to mechanism that makes it more likely for people
establish criterion validity. The question of interest with higher scores on a variable to be selected than
in criterion validity is, To what degree are the pre- people with lower scores. This is common in aca-
dictor and criterion related? or How well does the demic or employee selection as the scores on the
measure predict scores on the criterion variable? administered predictors (or variables related to
instead of, Are the predictor and criterion related? those predictors) form the basis of who is admitted
Effect sizes address the former questions, while sig- or hired. Range restriction is common in quasi-
nificance testing addresses the latter. As such, effect predictive designs (because predictor scores are
sizes are necessary to quantify how well the predic- used to select or admit people) and concurrent
tor and criterion are related and to provide a way designs (because people are selected in a way that
to compare the criterion validity of several differ- is related to the predictor variables of interest in
ent predictors. the study). For example, suppose people are hired
The specific effect size to be used is dependent into an organization on the basis of their interview
on the research context and types of data being scores. The researcher administers another poten-
collected. These can include (but are not limited tial predictor of the focal criterion in a concurrent
to) odds ratios, correlations, and standardized validation design. If the scores on this new predic-
mean differences. For the purposes of explanation, tor are correlated with scores on the interview,
it is assumed that there is a continuous predictor then range restriction will occur. True predictive
and a continuous criterion variable, making the validation designs are free from range restriction
correlation coefficient the appropriate measure of because either no selection occurs or selection
effect size. In this case, the correlation between occurs in a way uncorrelated with the predictors.
a given predictor and a specific criterion serves as In postdictive validation designs, any potential
the estimate of criterion validity. Working in the range restriction is typically controlled for in the
effect size metric has the added benefit of permit- matching scenario.
ting comparisons of criterion validity estimates for Range restriction becomes particularly problem-
several predictors. Assuming that two predictors atic when the researcher is interested in comparing
were collected under similar research designs and criterion validity estimates. This is because
conditions and are correlated with the same crite- observed criterion validity estimates for different
rion variable, then the predictor with the higher predictors can be differentially decreased because
correlation with the criterion can be said to have of range restriction. Suppose that two predictors
greater criterion validity than the other predictor that truly have equal criterion validity were admin-
(for that particular criterion and research context). istered to a set of applicants for a position.
If a criterion variable measures different behaviors Because of the nature of the way they were
or was collected under different research contexts selected, suppose that for Predictor A, 90% of the
(e.g., a testing situation prone to motivated distor- variability in predictor scores remained after peo-
tion vs. one without such motivation), then crite- ple were selected, but only 50% of the variability
rion validity estimates are not directly comparable. remained for Predictor B after selection. Because
294 Criterion Validity
a criterion variable given scores on a set of p pre- how the intercept and slope estimates, respectively,
dictor variables. change for the focal group.
The b2 and b3 coefficients have strong implica-
tions for bias in criterion validity estimates. If the
Predictive Bias b3 coefficient is large and positive (negative), then
the slope differences (and criterion validity esti-
A unique situation arises in applied selection
mates) are substantially larger (smaller) for the
situations because of federal guidelines requiring
focal group. However, if the b3 coefficient is near
criterion validity evidence for predictors that show
zero, then the criterion validity estimates are
adverse impact between protected groups. Pro-
approximately equal for the focal and reference
tected groups include (but are not limited to) eth-
groups. The magnitude of the b2 coefficient deter-
nicity, gender, and age. Adverse impact arises
mines (along with the magnitude of the b3 coeffi-
when applicants from one protected group (e.g.,
cient) whether the criterion scores are over- or
males) are selected at a higher rate than members
underestimated for the focal or references groups
from another protected group (e.g., females).
depending on their scores on the predictor vari-
Oftentimes, adverse impact arises because of sub-
able. It is generally accepted that for predictor
stantial group differences on the predictor on
variables with similar levels of criterion validity,
which applicants are being selected. In these
those exhibiting less predictive bias should be pre-
instances, the focal predictor must be shown to
ferred over those exhibiting more predictive bias.
exhibit criterion validity across all people being
However, there is some room for tradeoffs
selected. However, it is also useful to examine pre-
between criterion validity and predictive bias.
dictive bias.
For the sake of simplicity, predictive bias will be Matthew J. Borneman
explicated here only in the case of a single predic-
tor, though the concepts can certainly be extended See also Concurrent Validity; Correction for Attenuation;
to the case of multiple predictors. In order to Criterion Problem; Predictive Validity; Restriction of
examine the predictive bias of a criterion validity Range; Selection; Validity of Measurement
estimate for a specific predictor, it is assumed that
the variable on which bias is assessed is categori-
cal; examples would include gender or ethnicity. Further Readings
The appropriate equation would be Binning, J. F., & Barrett, G. V. (1989). Validity of
personnel decisions: A conceptual analysis of the
yi ¼ b0 þ b1x1i þ b2x2i þ b3 ðx1i x2i Þ, ð3Þ
inferential and evidential bases. Journal of Applied
where x1i and x2i are the scores on the continuous Psychology, 74, 478–494.
Brand, C. (1987). The importance of general intelligence.
predictor variable and the categorical demographic
In S. Modgil & C. Modgil (Eds.), Arthur Jensen:
variable, respectively, for person i, b1 is the regres- Consensus and controversy (pp. 251–265).
sion coefficient for the continuous predictor, b2 is Philadelphia: Falmer Press.
the regression coefficient for the categorical predic- Cleary, T. A. (1968). Test bias: Prediction of grades of
tor, b3 is the regression coefficient for the interac- Negro and White students in integrated colleges.
tion term, and other terms are defined as earlier. Journal of Educational Measurement, 5, 115–124.
Equation 3 has substantial implications for bias in Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981).
criterion validity estimates. Assuming the reference Measurement theory for the behavioral sciences. San
group for the categorical variable (e.g., males) is Francisco: W. H. Freeman.
coded as 0 and the focal group (e.g., females) is Kuncel, N. R., & Hezlett, S. A. (2007). Standardized
tests predict graduate students’ success. Science, 315,
coded as 1, the b0 coefficient gives the intercept for
1080–1081.
the reference group, and the b1 coefficient gives Nunnally, J. C. (1978). Psychometric theory (2nd ed.).
the regression slope for the reference group. These New York: McGraw-Hill.
two coefficients form the baseline of criterion Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin,
validity evidence for a given predictor. The b2 M. B. (2001). High-stakes testing in employment,
coefficient and the b3 coefficient give estimates of credentialing, and higher education: Prospects in
296 Criterion Variable
a post-affirmative action world. American effectively as the research design more readily per-
Psychologist, 56, 302–318. mits the adjustment of certain explanatory vari-
Schmidt, F. L., & Hunter, J. E. (1998). The validity and ables in isolation from the others, allowing
utility of selection methods in personnel psychology: a clearer judgment to be made about the nature of
Practical and theoretical implications of 85 years of
the relationship between the response and predic-
research findings. Psychological Bulletin, 124,
262–274.
tors. This entry’s focus is on types of criterion vari-
ables and analysis involving criterion variables.
The criterion variable is assumed to arise from an than the breakdown of variation used in princi-
exponential family distribution, the type of which pal components analysis.
leads to a canonical link function, the function for Finally, in all the modeling contexts in which
which XTY is a sufficient statistic for β, the vector criterion variables are used, there exists an asym-
of regression coefficients. Common examples of metry in the way in which criterion variables are
exponential family distributions with their corre- considered compared with the independent vari-
sponding canonical link function include the nor- ables, even in observational studies, in which both
mal (identity link), poisson (log link), binomial sets of variables are observed or measured as
and multinomial (logit link), and exponential and opposed to fixed, as in a designed experiment. Fit-
gamma (inverse link) distributions. Generalized ting methods and measures of fit used in these con-
linear models allow for very flexible modeling of texts are therefore designed with this asymmetry
the criterion variable while retaining most of the in mind. For example, least squares and misclassi-
advantages of simpler parametric models (compact fication rates are based on deviations of realiza-
models, easy prediction). Nonparametric models tions of the criterion variable from the predicted
for the criterion variable include methods such as values from the fitted model.
regression trees, projection pursuit, and neural
nets. These methods allow for flexible models that Michael A. Martin and Steven Roberts
do not rely on strict parametric assumptions,
although using such models for prediction can See also Canonical Correlation Analysis; Covariate;
prove challenging. Dependent Variable; Discriminant Analysis
Outside the regression context, in which the
goal is to model the value of a criterion variable Further Readings
given the values of the explanatory variables, other
types of analysis in which criterion variables play Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J.
a key role include discriminant analysis, wherein (1984). Classification and regression trees. Belmont,
the values of the predictor (input) variables are CA: Wadsworth.
Friedman, J. H., & Stuetzle, W. (1981). Projection
used to assign realizations of the criterion variable
pursuit regression. Journal of the American Statistical
into a set of predefined classes based on the values Association, 76, 817–823.
predicted for a set of linear functions of the predic- Hotelling, H. (1936). Relations between two sets of
tors called discriminant functions, through a model variates. Biometrika, 28, 321–377.
fit via data (called the training set) for which the McCullagh, P., & Nelder, J. A. (1989). Generalized linear
correct classes are known. models. New York: Chapman & Hall.
In canonical correlation analysis, there may McCulloch, W., & Pitts, W. (1943). A logical calculus of
be several criterion variables and several inde- the ideas immanent in nervous activity. Bulletin of
pendent variables, and the goal of the analysis is Mathematical. Biophysics, 5, 115–133.
to reduce the effective dimension of the data
while retaining as much of the dependence struc-
ture in the data as possible. To this end, linear
combinations of the criterion variables and of CRITICAL DIFFERENCE
the independent variables are chosen to maxi-
mize the correlation between the two linear com- Critical differences can be thought of as critical
binations. This process is then repeated with regions for a priori and post hoc comparisons of
new linear combinations as long as there remains pairs of means and of linear combinations of
significant correlation between the respective lin- means. Critical differences can be transformed into
ear combinations of criterion and independent confidence intervals. First, this entry discusses criti-
variables. This process resembles principal com- cal differences in the context of multiple compari-
ponents analysis, the difference being that corre- son tests for means. Second, this entry addresses
lation between sets of independent variables and confusion surrounding applying critical differences
sets of criterion variables is used as the means of for statistical significance and for the special case
choosing relevant linear combinations rather of consequential or practical significance.
298 Critical Difference
ni nj MIN ni ; nj
2. Neyman–Pearson: This school insists on an between the means is discernible and whether the
alternative hypothesis: ordering is statistically reliable.
J. Neyman would say that the alternative
H0 : μ1 μ2 ¼ 0 hypothesis, H1 , should represent the consequential
H1 : μ1 μ2 6¼ 0 scenario. For example, suppose that if the mean of
Treatment 1 exceeds the mean of Treatment 2 by
at least some amount c; then changing from Treat-
Ronald Fisher was the first to address the mat- ment 2 to Treatment 1 is consequential. The
ter of hypothesis testing. He defined statistical sig- hypotheses might take the following form:
nificance to address the need for discerning
between two treatments. His school emphasizes H0 : μ1 μ2 c ¼ 0
determining whether one treatment mean is
H1 : μ1 μ2 c > 0,
greater than the other, without stressing the magni-
tude of the difference. This approach is in keeping
with the usual critical differences as illustrated in where c is the consequential difference. In the
Table 1. They pronounce whether any difference application and instruction of statistics, the two
300 Critical Theory
classical schools are blended, which sows even postmodernism, and poststructuralism and also to
more confusion. show, through the evidence bases of literature and
To detect consequential significance, the final step reflective experiences, how critical theories can be
consists of adjusting the multiple comparison test by used by the researcher within different parts of
adding or multiplying by constants, which might be research design.
derived from economic or scientific calculations.
consequences of these changes for social organiza- analysis, arguing that multiculturalism did not focus
tion and consumer culture? on antiracist practice enough. CRT has that central
A very contemporary application of Marxism focus and goes further in relation to how minority
has been provided by Mike Cole, who applies groups are racialized and colored voices silenced.
Marxist ideas to education by using the example CRT offers everyone a voice to explore, examine,
of Venezuela and Hugo Chavez, who opposes cap- debate, and increase understandings of racism.
italism and imperialism. In education, Cole high- David Gillborn is an advocate of applying CRT
lights, Chavez has hoped to open 38 new state within research, explaining that the focus of CRT is
universities with 190 satellite classrooms through- an understanding that the status of Black and other
out Venezuela by 2009. Social projects such as minority groups is always conditional on Whites.
housing are linked to this policy. Communal coun- As Gillborn highlights, to many Whites, such an
cils have been created whereby the local popula- analysis might seem outrageous, but its perceptive-
tion meets to decide on local policies and how to ness was revealed in dramatic fashion in July 2005,
implement them, rather than relying on bourgeois with the terrorist attacks on London and Madrid.
administrative machinery. Chavez is not only talk- Gillborn underlines that this was the clearest dem-
ing about democratic socialism but applying it to onstration of the conditional status of people of
government policy. Therefore, one can apply color in contemporary England. So power condi-
Marxist critique politically in different parts of the tions are highlighted by CRT. The application of
world. Chavez’s policies are a reaction to capital- CRT within research design can be applied in many
ism and the colonial legacy and have the objective subject areas and disciplines. That notion of power
of moving Venezuela in a more socialist direction. and how it is created, reinforced, and controlled is
Application and interpretation are the keys when an important theme within critical theory.
applying critical theory within research design.
Cole applies Marxist ideas to the example of Vene-
Postmodernism
zuela and provides evidence to interpret the events
that are taking place. Critical theory can be effec- Postmodernism also examines power relations and
tive when it is applied in contemporary contexts. how power is made and reinforced. To understand
what postmodernism is, we have to understand
what modernism was. Peter Barry explains that
Critical Race Theory
modernism is the name given to the movement that
The critical race theory (CRT) movement is a collec- dominated arts and culture in the first half of the
tion of activists and scholars interested in studying 20th century. Practice in music, literature, and
and transforming the relationship among race, rac- architecture was challenged. This movement of
ism, and power. Although CRT began as a move- redefining modernism as postmodernist can be seen
ment in the United States within the subject of law, in the changing characteristics of literary modern-
it has rapidly spread beyond that discipline. Today, ism. Barry provides a position that concerns liter-
academics in the social and behavioral sciences, ary modernism and the move to postmodernist
including the field of education, consider themselves forms of literature. It can be applied more broadly
critical race theorists who use CRT ideas to, accord- to research design and the application of critical
ing to Richard Delgardo and Jean Stefancic, under- theory. The move away from grand-narrative social
stand issues of school discipline, controversies, and cultural theories can be seen in the above as
tracking, and IQ and achievement testing. CRT philosophers and cultural commentators moved
tries not only to understand our social situation but away from ‘‘objective’’ positions and began to
to change it. The focus of CRT is racism and how it examine multiple points of views and diverse moral
is socially and culturally constructed. CRT goes positions. Postmodernists are skeptical about over-
beyond a conceptual focus of multiculturalism, all answers to questions that allow no space for
which examines equal opportunities, equity, and debate. Reflexivity allows the researcher to reflect
cultural diversity. Barry Troyna carried out educa- on his or her own identity or identities within
tion research in the United Kingdom during the a given profession. The idea of moving beyond
1970s and 1980s with an antiracist conceptual a simplistic mirror image or ‘‘Dear Diary’’
302 Critical Theory
approach to a reflective method with the applica- this form of critical theory aims to deconstruct the
tion of different evidence bases of literature reviews grand narratives and structural theoretical frame-
to personal experiences shows how status, roles, works. Poststructuralism also attempts to increase
and power can be critically analyzed. The plurality understandings of language and the ways knowl-
of roles that occurs is also a postmodern develop- edge and power are used and evolve to shape how
ment with the critical questioning of the world and we view structures (e.g., the institution in which
the multiple identities that globalization gives the we work and how it works) and why we and
individual. In relation to research design, a post- others accept how these structures work. Poststruc-
modern approach gives the researcher more possi- turalism is critical of these processes and attempts
bilities in attempting to increase understandings of to analyze new and alternative meanings. In rela-
a research area, question, or hypothesis. Fragmen- tion to research design, it is not only how we apply
ted forms, discontinuous narrative, and the random critical theory to poststructuralist contexts, it is
nature of material can give the researcher more how we attempt to read the theory and theorists.
areas or issues to examine. This can be problematic Michel Foucault is a poststructuralist, and his
as research focus is an important issue in the works are useful to read in association with issues
research process and it is vital for the researcher to of research design. Poststructural ideas can be used
stay focused. That last line would immediately be to examine the meanings of different words and
questioned by a postmodernist because the position how different people hold different views or mean-
is one of constant questioning and potential change. ings of those words. The plurality of poststructural-
The very nature of research and research design ist debate has been criticized because considering
could be questioned by the postmodernist. The different or all arguments is only relative when an
issue here is the creation and development of ideas, absolute decision has to be taken. However, it is
as it is for all researchers. Jean-François Lyotard the question of language and meaning in relation
believed that the researcher and intellectual should to power and knowledge that offers the researcher
resist the grand ideas and narratives that had a different angle within research design.
become, in his opinion, outdated. Applying that
directly to research design, a modernist argument
would be that the research process should consist Application Within the
of an introduction with a research question or
Social and Behavioral Sciences
hypothesis; literature review; method and method-
ology; data collection, presentation, and analysis; This final section highlights how critical theory, be
and a conclusion. If one were being critical of that it Marxism, CRT, postmodernism, or poststructur-
provisional research design structure, one could alism, can be applied within the social and behav-
suggest that research questions (plural) should be ioral sciences. It is not only the word application
asked, literature reviews (subject specific, general, that needs to be focused on but interpretation and
theoretical, conceptual, method, data) should be one’s interpretations in relation to one’s research
carried out, and all literature should be criticized; question or hypothesis. Researchers reading the
positivist research paradigms should be dropped in primary sources of Marx, Foucault, or Lyotard and
favor of more reflective, action research projects; applying them to a research design is all very well,
and other questions should be answered rather than but interpreting one’s own contextual meanings to
the focal question or hypothesis posed at the begin- methodology from the literature reviews and then
ning of the research project. Research design itself applying and interpreting again to data analysis
would be questioned because that is the very nature seems to be more difficult. Two different meanings
of postmodernist thought: the continuing critique could materialize here, which is due to the fact that
of the subject under examination. space and time needs to be given within research
design for reading and rereading critical theories to
increase understandings of what is being
Poststructuralism
researched. Critical theories can be described as
The issue of power and knowledge creation is also windows of opportunity in exploring research pro-
examined within poststructuralism in the sense that cesses in the social and behavioral sciences. They
Critical Thinking 303
can be used as a tool within research design to A proposition is a statement that claims to be
inform, examine, and ultimately test a research true, a statement that claims to be a good guide to
question and hypothesis. Critical theoretical frame- reality. Not all statements that sound as if they
works can be used within research introductions, may be true or false function as propositions, so
literature reviews, and method and methodology the first step in critical thinking is often to consider
processes. They can have a role to play in data whether a proposition is really being advanced.
analysis and research conclusions or recommenda- For example, ‘‘I knew this was going to happen’’ is
tions at the end of a research project. It is how the often an effort to save face or to feel some control
researcher uses, applies, and interprets critical the- over an unfortunate event rather than an assertion
ory within research design that is the key issue. of foreknowledge, even though it sounds like one.
Conversely, a statement may not sound as if it has
Richard Race a truth element, but on inspection, one may be dis-
covered. ‘‘Read Shakespeare’’ may sometimes be
See also Literature Review; Methods Section; Research
translated as the proposition, ‘‘Private events are
Design Principles; Research Question; Theory
hard to observe directly, so one way to learn more
about humans is to observe public representations
of private thoughts as described in context by cele-
Further Readings brated writers.’’ Critical thinking must evaluate
Delgardo, R., & Stefancic, J. (2001). Critical race theory: statements properly stated as propositions; many
An introduction. New York: New York University disagreements are settled simply by ascertaining
Press. what, if anything, is being proposed. In research, it
Foucault, M. (1991). Discipline and punish: The birth of is useful to state hypotheses explicitly and to define
the prison. London: Penguin Books. the terms of the hypotheses in a way that allows
Foucault, M. (2002). Archaeology of knowledge. all parties to the conversation to understand
London: Routledge. exactly what is being claimed.
Lyotard, J.-F. (1984). The postmodern condition: A
Critical thinking contextualizes propositions; it
report on knowledge. Minneapolis: University of
helps the thinker consider when a proposition is
Minnesota Press.
Malpas, S., & Wake, P. (Eds.). (2006). The Routledge true or false, not just whether it is true or false. If
companion to critical theory. London: Routledge. a proposition is always true, then it is either a tau-
Troyna, B. (1993). Racism and education. Buckingham, tology or a natural law. A tautology is a statement
UK: Open University Press. that is true by definition: ‘‘All ermines are white’’
is a tautology in places where nonwhite ermines
are called weasels. A natural law is a proposition
that is true in all situations, such as the impossibil-
ity of traveling faster than light in a vacuum. The
CRITICAL THINKING validity of all other propositions depends on the
situation. Critical thinking qualifies this validity by
Critical thinking evaluates the validity of proposi- specifying the conditions under which they are
tions. It is the hallmark and the cornerstone of sci- good guides to reality.
ence because science is a community that aims to Logic is a method of deriving true statements
generate true statements about reality. The goals from other true statements. A fallacy occurs when
of science can be achieved only by engaging in an a false statement is derived from a true statement.
evaluation of statements purporting to be true, This entry discusses methods for examining pro-
weeding out the false ones, and limiting the true positions and describes the obstacles to critical
ones to their proper contexts. Its centrality to the thinking.
scientific enterprise can be observed in the privi-
leges accorded to critical thinking in scientific dis-
Seven Questions
course. It usually trumps all other considerations,
including tact, when it appears in a venue that Critical thinking takes forms that have proven
considers itself to be scientific. effective in evaluating the validity of propositions.
304 Critical Thinking
Generally, critical thinkers ask, in one form or dualities. Thus, a critical thinker would want to
another, the following seven questions: examine whether it makes sense to consider one
citizen better than another or whether the proposi-
1. What does the statement assert? What is tion is implying that schools are responsible for
asserted by implication? social conduct rather than for academics.
2. What constitutes evidence for or against the Critical thinkers are also alert to artificial cate-
proposition? gories. When categories are implied by a proposi-
tion, they need to be examined as to whether they
3. What is the evidence for the proposition? What
is the evidence against it? really exist. Most people would accept the reality
of the category school in the contemporary United
4. What other explanations might there be for the States, but not all societies have clearly demarcated
evidence? mandatory institutions where children are sent dur-
5. To which circumstances does the proposition ing the day. It is far from clear that the categories
apply? of smaller schools and larger schools stand up to
6. Are the circumstances currently of interest like scrutiny, because school populations, though not
the circumstances to which the proposition falling on a smooth curve, are more linear than cat-
applies? egorical. The proponent might switch the proposi-
tion to School size predicts later criminal activity.
7. What motives might the proponent of the
proposition have besides validity?
What Constitutes Evidence
for or Against the Proposition?
What Does the Statement Assert?
Before evidence is evaluated for its effect on
What Is Asserted by Implication?
validity, it must be challenged by questions that
The proposition Small schools produce better ask whether it is good evidence of anything. This
citizens than large schools do can be examined as is generally what is meant by reliability. If a study
an illustrative example. The first step requires the examines a random sample of graduates’ criminal
critical thinker to define the terms of the proposi- records, critical thinkers will ask whether the sam-
tion. In this example, the word better needs elabo- ple is truly random, whether the available criminal
ration, but it is also unclear what is meant by records are accurate and comprehensive, whether
citizen. Thus, the proponent may mean that better the same results would be obtained if the same
citizens are those who commit fewer crimes or per- records were examined on different days by differ-
haps those who are on friendly terms with a larger ent researchers, and whether the results were cor-
proportion of their communities than most citizens. rectly transcribed to the research protocols.
Critical thinkers are alert to hidden tautologies, It is often said that science relies on evidence
or to avoiding the fallacy of begging the question, rather than on ipse dixits, which are propositions
in which begging is a synonym for pleading (as in accepted solely on the authority of the speaker.
pleading the facts in a legal argument) and ques- This is a mistaken view, because all propositions
tion means the proposition at stake. It is fallacious ultimately rest on ipse dixit. In tracking down
to prove something by assuming it. In this exam- criminal records, for example, researchers will
ple, students at smaller schools are bound to be on eventually take someone’s—or a computer’s—
speaking terms with a higher proportion of mem- word for something.
bers of the school community than are students at
larger schools, so if that is the definition of better
What Is the Evidence for the Proposition?
citizenship, the proposition can be discarded as
What Is the Evidence Against It?
trivial. Some questions at stake are so thoroughly
embedded in their premises that only very deep These questions are useful only if they are
critical thinking, called deconstruction, can reveal asked, but frequently people ask only about the
them. Deconstruction asks about implied assump- evidence on the side they are predisposed to
tions of the proposition, especially about unspoken believe. When they do remember to ask, people
Critical Thinking 305
have a natural tendency, called confirmation bias, Are the Circumstances Currently
to value the confirming evidence and to dismiss of Interest Like the Circumstances
the contradictory evidence. to Which the Proposition Applies?
Once evidence is adduced for a proposition, one
Once a proposition has been validated for a par-
must consider whether the very same evidence
ticular set of circumstances, the critical thinker
may stand against it. For example, once someone
examines the current situation to determine
has argued that distress in a child at seeing a parent
whether it is sufficiently like the validating circum-
during a stay in foster care is a sign of a bad rela-
stances to apply the proposition to it. In the social
tionship, it is difficult to use the same distress as
sciences, there are always aspects of the present case
a sign that it is a good relationship. But critical
that make it different from the validating circum-
thinking requires questioning what the evidence is
stances. Whether these aspects are different enough
evidence of.
to invalidate the proposition is a matter of judg-
ment. For example, the proposition relating school
What Other Explanations size to future criminality could have been validated
Might There Be for the Evidence? in California, but it is unclear whether it can be
applied to Texas. Critical thinkers form opinions
To ensure that an assertion of causality is cor-
about the similarity or differences between situa-
rect, either one must be able to change all but the
tions after considering reasons to think the current
causal variable and produce the same result, or one
case is different from or similar to the typical case.
must be able to change only the proposed cause
and produce a different result. In practice, espe-
cially in the social areas of science, this never hap-
What Motives Might the Proponent of
pens, because it is extremely difficult to change
the Proposition Have Besides Validity?
only one variable and impossible to change all vari-
ables except one. Critical thinkers identify variables The scientific community often prides itself on
that changed along with the one under consider- considering the content of an argument rather than
ation. For example, it is hard to find schools of dif- its source, purporting to disdain ad hominem argu-
ferent sizes that also do not involve communities ments (those being arguments against the propo-
with different amounts of wealth, social upheaval, nent rather than against the proposition).
or employment opportunities. Smaller schools may However, once it is understood that all evidence
more likely be private schools, which implies ultimately rests on ipse dixits, it becomes relevant
greater responsiveness to the families paying the to understand the motivations of the proponent.
salaries, and it may be that responsiveness and Also, critical thinkers budget their time to examine
accountability are more important than size per se. relevant propositions, so a shocking idea from
a novice or an amateur or, especially, an interested
party is not always worth examining. Thus, if
To Which Circumstances
a superintendent asserts that large schools lead to
Does the Proposition Apply?
greater criminality, one would want to know what
If a proposition is accepted as valid, then it is this official’s budgetary stake was in the argument.
either a law of nature—always true—or else it is Also, this step leads the critical thinker full circle,
true only under certain circumstances. Critical back to the question of what is actually being
thinkers are careful to specify these circumstances asserted. If a high school student says that large
so that the proposition does not become overly schools increase criminality, he may really be ask-
generalized. Thus, even if a causal relationship ing to transfer to a small school, a request that
were accepted between school size and future may not depend on the validity of the proposition.
criminality, the applicability of the proposition Thus, critical thinkers ask who is making the
might have to be constricted to inner cities or to assertion, what is at stake for the speaker, from
suburbs, or to poor or rich schools, or to schools what position the speaker is speaking, and under
where entering test scores were above or below what conditions or constraints, to what audience,
a certain range. and with what kind of language. There may be no
306 Critical Thinking
clear or definitive answers to these questions; how- a certain amount of tact. It can be awkward to
ever, the process of asking is one that actively question someone’s assertions about reality, and
engages the thinking person in the evaluation of downright rude to challenge their assertions about
any proposition as a communication. themselves. When people say something is true
and it turns out not to be true, or not to be always
The Role of Theory true, they lose face. Scientists try to overcome this
loss of face by providing a method of saving face,
When critical thinkers question what constitutes namely, by making a virtue of self-correction and
good evidence, or whether the current situation is putting truth ahead of pride. But without a com-
like or unlike the validating circumstances, or mitment to science’s values, critical thinking can
what motives the proponent may have, how do lead to hurt feelings.
they know which factors to consider? When they
ask about other explanations, where do other
explanations come from? Theory, in the sense of Losing Faith
a narrative that describes reality, provides these Religious faith is often expressed in spiritual and
factors and explanations. To use any theorist’s the- moral terms, but sometimes it is also expressed in
ory to address any of these issues is to ask what factual terms—faith that certain events happened at
the theorist would say about them. a certain time or that the laws of nature are some-
times transcended. When religion takes a factual
Multiculturalism turn, critical thinking can oppose it, and people can
feel torn between religion and science. Galileo said
Multiculturalism provides another set of questions that faith should concern itself with how to go to
to examine the validity of and especially to constrict heaven, and not with how the heavens go. When
the application of a proposition. Someone thinking people have faith in how reality works, critical
of large suburban high schools and small suburban thinking can become the adversary of faith.
parochial schools may think that the proposition
relating school size to criminality stands apart from
race, sex, and ethnicity. Multicultural awareness Losing Friends
reminds us to ask whether these factors matter. Human beings are social; we live together
according to unspoken and spoken agreements,
Obstacles to Critical Thinking and our social networks frequently become com-
munities of practice wherein we express our
If critical thinking is such a useful process for get-
experiences and observations in terms that are
ting at the truth, for producing knowledge that
agreeable to and accepted by our friends and
may more efficiently and productively guide our
immediate communities. Among our intimates and
behavior—if it is superior to unquestioning accep-
within our social hierarchies, we validate each
tance of popular precepts, common sense, gut
other’s views of reality and find such validation
instinct, religious faith, or folk wisdom—then why
comforting. We embed ourselves in like-minded
is it not more widespread? Why do some people
communities, where views that challenge our own
resist or reject critical thinking? There are several
are rarely advanced, and when they are, they and
reasons that it can be upsetting. Critical thinkers
their proponents are marginalized—labeled silly,
confront at least six obstacles: losing face, losing
dangerous, crazy, or unreasonable—but not seri-
faith, losing friends, thinking the unthinkable,
ously considered. This response to unfamiliar or
challenging beliefs, and challenging believing.
disquieting propositions strengthens our status as
members of the ingroup and differentiates us from
Losing Face
the outgroup. Like-minded communities provide
Critical thinking can cause people to lose face. reassurance, and critical thinking—which looks
In nonscientific communities, that is, in communi- with a curious, analytical, and fearless eye on pro-
ties not devoted to generating true statements positions—threatens not only one’s worldview but
about reality, propositions are typically met with one’s social ties.
Critical Value 307
When using a one-tailed test (either left-tailed When using a statistical table to reference
or right-tailed), the CV will be on either the left a CV, it is sometimes necessary to interpolate, or
or the right side of the mean. Whether the CV is estimate values, between CVs in a table because
on the left or right side of the mean is dependent such tables are not exhaustive lists of CVs. For
on the conditions of an alternative hypothesis. For the following example, the t-distribution table is
example, a scientist might be interested in increas- used. Assume that we want to find the critical t
ing the average life span of a fruit fly; therefore, value that corresponds to 42 df using a signifi-
the alternative hypothesis might be H1 : μ > 40 cance level or alpha of .05 for a two-tailed test.
days. Subsequently, the CV is on the right side of The table has CVs only for 40 df (CV ¼ 2.021)
the mean. Likewise, the null hypothesis would be and 50 df (CV ¼ 2.009). In order to calculate
rejected only if the sample mean is greater than the desired CV, we must first find the distance
40 days. This example would be referred to as between the two known dfs (50 40 ¼ 10).
a one-tailed right test. Then we find the distance between the desired df
To use the CV to determine the significance of and the lower known df (42 40 ¼ 2). Next,
a statistic, the researcher must state the null and we calculate the proportion of the distance that
alternative hypotheses; set the level of significance, the desired df falls from the lower known
or alpha level, at which the null hypothesis will be 2
df ð ¼ :20Þ. Then we find the distance between
rejected; and compute the test value (and the cor- 10
responding degrees of freedom, or df, if necessary). the CVs for 40 df and 50 df ð2:021 2:009 ¼
The investigator can then use that information to :012Þ. The desired CV is .20 of the distance
select the CV from a table (or calculation) for the between 2.021 and 2.009 ð:20 × :012 ¼ :0024Þ.
appropriate test and compare it to the statistic. Since the CVs decrease as the dfs increase, we
The statistical test the researcher chooses to use subtract .0024 from CV for 40 df ð2:021
(e.g., z-score test, z test, single sample t test, inde- :0024 ¼ 2:0186Þ; therefore, the CV for 42 df
pendent samples t test, dependent samples t test, with an alpha of .05 for a two-tailed test is
one-way analysis of variance, Pearson product- t ¼ 2.0186.
moment correlation coefficient, chi-square) deter- Typically individuals do not need to reference
mines which table he or she will reference to statistical tables because statistical software
obtain the appropriate CV (e.g., z-distribution packages, such as SPSS, an IBM company, formerly
table, t-distribution table, F-distribution table, called PASWâ Statistics, indicate in the output
Pearson’s table, chi-square distribution table). whether a test value is significant and the level of
These tables are often included in the appendixes significance. Furthermore, the computer calcula-
of introductory statistics textbooks. tions are more accurate and precise than the infor-
For the following examples, the Pearson’s mation presented in statistical tables.
table, which gives the CVs for determining
whether a Pearson product-moment correlation Michelle J. Boyd
(r) is statistically significant, is used. Using an See also Alternative Hypotheses; Degrees of Freedom; Null
alpha level of .05 for a two-tailed test, with Hypothesis; One-Tailed Test; Significance, Statistical;
a sample size of 12 (df ¼ 10), the CV is .576. In Significance Level, Concept of; Two-Tailed Test
other words, for a correlation to be statistically
significant at the .05 significance level using
Further Readings
a two-tailed test for a sample size of 12, then the
absolute value of Pearson’s r must be greater Heiman, G. W. (2003). Basic statistics for the behavioral
than or equal to .576. Using a significance level sciences. Boston: Houghton Mifflin.
of .05 for a one-tailed test, with a sample size of
12 (df ¼ 10), the CV is .497. Thus, for a correla-
tion to be statistically significant at the .05 level
using a one-tailed test for a sample size of 12, CRONBACH’S ALPHA
then the absolute value of Pearson’s r must be
greater than or equal to .497. See Coefficient Alpha
Crossover Design 309
design should not be used in clinical trials in which simplest design is the two-treatment–two-period,
a treatment cures a disease and no underlying con- or 2 × 2, design. Different treatment sequences
dition remains for the next treatment period. are used to eliminate sequence effects. The cross-
Crossover designs are typically used for persistent over design cannot accommodate a separate com-
conditions that are unlikely to change over the parison group. Because each experimental unit
course of the study. The carryover effect can cause receives all treatments, the covariates are balanced.
problems with data analysis and interpretation of The goal of the crossover design is to compare the
results in a crossover design. The carryover pre- effects of individual treatments, not the sequences
vents the investigators from determining whether themselves.
the significant effect is truly due to a direct treat-
ment effect or whether it is a residual effect of
Variance Balance and Unbalance
other treatments. In multiple regressions, the car-
ryover effect often leads to multicollinearity, which In the feedstuffs example above, each treatment
leads to erroneous interpretation. If the crossover occurs one time in each experimental unit (or sub-
design is used and a carryover effect exists, a design ject) and each of the six three-treatment sequences
should be used in which the carryover effect will occurs two times, which confers the property of
not be confounded with the period and treatment balance known as variance balance. In general,
effects. The carryover effect makes the design less a crossover design is balanced for carryover effect
efficient and more time-consuming. if all possible sequences are used an equal number
The period effect occurs in crossover design of times in the experiment, and each treatment
because of the conditions that are present at the occurs equal times in each period and occurs once
time the observed values are taken. These condi- with each experimental unit. However, this is not
tions systematically affect all responses that are always possible; deaths and dropouts may occur in
taken during that time, regardless of the treatment a trial, which can lead to unequal numbers in
or the subject. For example, the subjects need to sequences.
spend several hours to complete all the treatments In the example above and Table 1, A → B
in sequence. The subjects may become fatigued occurs twice, once each in Sequences 1 and 3; and
over time. The fatigue factor may tend to system- A → C occurs once each in Sequences 4 and 5.
atically impact all the treatment effects of all the Similarly, B → A, B → C, C → A, and C → B each
subjects, and the researcher cannot control this sit- occur twice. For Experimental Units 1 and 2,
uation. If the subject is diseased during the first A → B brings the first-order carryover effect in this
period, regardless of treatment, and the subject is experiment and changes the response in the first
disease free by the time the second period starts, period following the application of the treatment;
this situation is also a period effect. In many cross- similarly, the second-order carryover effect changes
over designs, the timing and spacing of periods is the response in the second period following the
relatively loose. In clinical trials, the gap between application of the treatment. A 21-day rest period
periods may vary and may depend on when the is used to wash out the effect of treatment before
patient can come in. Large numbers of periods are the next treatment is applied. With the absence of
not suitable for animal feeding and clinical trials carryover effects, the response measurements
with humans. They may be used in psychological reflect only the current treatment effect.
experiments with up to 128 periods. In variance balance crossover design, all treat-
Sequence effect is another issue in crossover ment contrasts are equally precise; for instance, in
design. Sequence refers to the order in which the the example above, we have
treatments are applied. The possible sets of
sequences that might be used in a design depend varðτ A τB Þ ¼ varðτB τC Þ ¼ varðτ A τ C Þ,
on the number of the treatments, the length of the
sequences, and the aims of the experiment or trial. where var ¼ variance and τ ¼ a treatment group
For instance, with t treatments there are t! possible mean.
sequences. The measurement of sequence effect is The contrasts of the carryover effects are also
the average response over the sequences. The equally precise. The treatment and carryover
Crossover Design 311
λ1 and λ2 , respectively, in Sequence 1. Likewise, applied to compare the treatment effects and
the first-order carryover effects of Treatments B period effects. The crossover should not be used if
and C are λ2 and λ3 , respectively, in Sequence 2. there is a carryover effect, which cannot be sepa-
Actually, crossover designs are specific repeated rated from treatment effect or period effects. For
measures designs with observed values on each this case, bootstrapping, permutation, or randomi-
experimental unit that are repeated under different zation tests provide alternatives to normal theory
treatment conditions at different time points. The analyses. Gail Tudor and Gary G. Koch described
design provides a multivariate observation for each the nonparametric method and its limitations for
experimental unit. statistical models with baseline measurements and
The univariate analysis of variance can be used carryover effects. Actually, this method is an exten-
for the crossover designs if any of the assumptions sion of the Mann–Whitney, Wilcoxon’s, or
of independence, compound symmetry, or the Quade’s statistics. Other analytical methods are
Huynh–Feldt condition are appropriate for the available if there is no carryover effect. For most
experimental errors. Where independence and com- designs, nonparametric analysis is much more lim-
pound symmetry are sufficient conditions to justify ited than a parametric analysis.
ordinary least squares, the Huynh–Feldt condition
(Type H structure) is both a sufficient and necessary
Approaches for Ordinal and Binary Data
condition for use of ordinary least squares. There
are two cases of ANOVA for crossover design: The New statistical techniques have been developed
first is the analysis of variance without carryover to deal with longitudinal data of this type over
effect if the crossover design is a balanced row–col- recent decades. These new techniques can also be
umn design. The experimental units and periods are applied in the analysis of crossover design. A mar-
the rows and columns of the design, and the direct ginal approach using a weighted least square
treatment effects are orthogonal to the columns. method is proposed by J. R. Landis and others. A
The analysis approach is the same as that of a latin generalized estimating equation approach was
square experiment. The second is the analysis of var- developed by Kung-Yee Liang and colleagues in
iance with carryover effect, which is treated as 1986. For the binary data, subject effect models
a repeated measures split-plot design with the sub- and marginal effect models can be applied. Boot-
jects as whole plots and the repeated measures over strap and permutation tests are also useful for this
the p periods as the subplots. The total sum of type of data. Variance and covariance structures
squares is calculated from between and within sub- need not necessarily be considered. The different
jects’ parts. The significance of the carryover effects estimates of the treatment effect may be obtained
must be determined before the inference is made on on the basis of different assumptions of the mod-
the comparison of the direct effects of treatments. els. Although researchers have developed different
For normal data without missing values, a least approaches for different scenarios, each approach
squares analysis to obtain treatment, period, and has its own problems or limitations. For example,
subject effect is efficient. Whenever there are miss- marginal models can be used to deal with missing
ing data, within-subject treatment comparisons are data, such as dropout values, but the designs lose
not available for every subject. Therefore, addi- efficiency, and their estimates are calculated with
tional between-subject information must be used. less precision. The conditional approach loses
In crossover design, the dependent variable may information about patient behavior and it is
not be normally distributed, such as an ordinal or restricted to the logit link function. M. G. Ken-
a categorical variable. The crossover analysis ward and B. Jones in 1994 gave a more compre-
becomes much more difficult if a period effect exists. hensive discussion of different approaches.
Two basic approaches can be considered in this case. Usually, large sample sizes are needed when the
data are ordinal or dichotomous, which is a limita-
tion of crossover designs.
Approaches of Nonnormal Data
Many researchers have tried to find an ‘‘opti-
If the dependent variables are continuous but mal’’ method. J. Kiefer in the 1970s proposed the
not normal, the Wilcoxon’s rank sum test can be concept of universal optimality and named D-, A-,
Cross-Sectional Design 313
and E-optimality. The design is universally optimal sectional study cannot provide a very rich picture
and satisfies other optimal conditions, but it is of development; by definition, such a study exam-
extremely complicated to find a universally opti- ines one small group of individuals at only one
mal crossover design for any given scenario. point in time. Finally, it is difficult to compare
groups with one another, because unlike a longitu-
Ying Liu dinal design, participants do not act as their own
controls. Cross-sectional studies are quick and rel-
See also Block Design; Repeated Measures Design
atively simple, but they do not provide much infor-
mation about the ways individuals change over
Further Readings time.
As with longitudinal designs, cross-sectional
Davis, A. W., & Hall, W. B. (1969). Cyclic change-over
designs result in another problem: the confounding
designs. Biometrika, 56, 283–293.
of age with another variable—the cohort (usually
Landis, J. R., & Koch, G. G. (1977). The measurement
of observer agreement for categorical data. Biometrics, thought of as year of birth). Confounding is the
33, 159–174. term used to describe a lack of clarity about
Lasserre, V. (1991). Determination of optimal design whether one or another variable is responsible for
using linear models in crossover trials. Statistics in observed results. In this case, we cannot tell
Medicine, 10, 909–924. whether the obtained results are due to age
Liang, K. Y., & Zegar, S. L. (1986). Longitudinal data (reflecting changes in development) or some other
analysis using generalized linear models. Biometrika, variable.
73, 13–22. Confounding refers to a situation in which the
effects of two or more variables on some outcome
cannot be separated. Cross-sectional studies con-
found the time of measurement (year of testing)
CROSS-SECTIONAL DESIGN and age. For example, suppose you are studying
the effects of an early intervention program on
The methods used to study development are as later social skills. If you use a new testing tool that
varied as the theoretical viewpoints on the process is very sensitive to the effects of early experience,
itself. In fact, often (but surely not always) the you might find considerable differences among dif-
researcher’s theoretical viewpoint determines the ferently aged groups, but you will not know
method used, and the method used usually reflects whether the differences are attributable to the year
the question of interest. Age correlates with all of birth (when some cultural influence might have
developmental changes but poorly explains them. been active) or to age. These two variables are
Nonetheless, it is often a primary variable of con- confounded.
cern in developmental studies. Hence, the two tra- What can be done about the problem of con-
ditional research designs, longitudinal methods, founding age with other variables? K. Warner
which examine one group of people (such as peo- Schaie first identified cohort and time of testing as
ple born in a given year), following and reexamin- factors that can help explain developmental out-
ing them at several points in time (such as in 2000, comes, and he also devised methodological tools
2005, and 2010), and cross-sectional designs, to account for and help separate the effects of age,
which examine more than one group of people (of time of testing, and cohort. According to Schaie,
different ages) at one point in time. For example, age differences among groups represent matura-
a study of depression might examine adults of tional factors, differences caused by when a group
varying ages (say 40, 50, and 60 years old) in was tested (time of testing) represent environmen-
2009. tal effects, and cohort differences represent envi-
Cross-sectional studies are relatively inexpen- ronmental or hereditary effects or an interaction
sive and quick to conduct (researchers can test between the two. For example, Paul B. Baltes and
many people of different ages at the same time), John R. Nesselroade found that differences in the
and they are the best way to study age differences performance of adolescents of the same age on
(not age changes). On the other hand, a cross- a set of personality tests were related to the year in
314 Cross-Validation
which the adolescents were born (cohort) as well available data into two parts, called training data
as when these characteristics were measured (time and testing data, respectively. The training data
of testing). are used for fitting the model or training the algo-
Sequential development designs help to over- rithm, while the testing data are used for validat-
come the shortcomings of both cross-sectional and ing the performance of the fitted model or the
longitudinal developmental designs, and Schaie trained algorithm on predication purpose.
proposed two alternative models for developmen- A typical proportion of the training data might
tal research—the longitudinal sequential design be roughly 1/2 or 1/3 when the data size is large
and the cross-sectional sequential design—that enough. The division of the data into training part
avoid the confounding that results when age and and testing part can be done naturally or ran-
other variables compete for attention. Cross- domly. In some applications, a large enough
sectional sequential designs are similar to longitu- subgroup of the available data is collected inde-
dinal sequential designs except that they do not pendently of the other parts of the data by differ-
repeat observations on the same people from the ent people or institutes, or through different
cohort; rather, different groups are examined from procedures but for similar purposes. Naturally,
one testing time to the next. For example partici- that part of the data can be extracted and used for
pants tested in 2000, 2005, and 2010 would all testing purpose only. If such a subgroup does not
come from different sets of participants born in exist, one can randomly draw a predetermined
1965. Both of these designs allow researchers to proportion of data for training purposes and leave
keep certain variables (such as time of testing or the rest for testing.
cohort) constant while they test the effects of
others.
Neil J. Salkind K-Fold Cross-Validation
In many applications, the amount of available
See also Control Variables; Crossover Design;
data is not large enough for a simple cross-vali-
Independent Variable; Longitudinal Design; Research
dation. Instead, K-fold cross-validation is com-
Hypothesis; Research Question; Sequential Design
monly used to extract more information from
the data. Unlike the simple training-and-testing
Further Readings division, the available data are randomly divided
into K roughly equal parts. Each part is chosen
Schaie, K. W. (1992). The impact of methodological in turn for testing purposes, and each time the
changes in gerontology. International Journal of remaining ðK 1Þ parts are used for training
Human Development & Aging, 35, 19–29.
purposes. The prediction errors from all the K
Birren, J. E., & Schaie, K. W. (Eds.). (2006). Handbook
of the psychology of aging (6th ed.). San Diego, CA:
validations are collected, and the sum is used for
Elsevier. cross-validation purposes.
Sneve, M., & Jorde, R. (2008). Cross-sectional study on In order to formalize the K-fold cross-valida-
the relationship between body mass index and tion using common statistical notations, suppose
smoking, and longitudinal changes in body mass index the available data set consists of N observations
in relation to change in smoking status: The Tromsø or data points. The ith observation includes pre-
study. (2008). Scandinavian Journal of Public Health, dictor(s) xi in scalar (or vector) form and
36(4), 397–407. response yi , also known as input and output,
respectively. Suppose a random partition of the
data divides the original index set f1; 2 ; . . . ; N g
into K subsets Ið1Þ, Ið2Þ ; . . . ; IðKÞ with roughly
CROSS-VALIDATION equal sizes. For the kth subset, let ^fk ð · Þ be the
fitted prediction function based on the rest of
Cross-validation is a data-dependent method for the data after removing the kth subset. Then the
estimating the prediction error of a fitted model or K-fold cross-validation targets the average pre-
a trained algorithm. The basic idea is to divide the diction error defined as
Cross-Validation 315
1X K X k
Cross-Validation Applications
CV ¼ L yi ; ^f ðxi Þ :
N k ¼ 1 i ∈ IðkÞ The idea of cross-validation can be traced back
to the 1930s. It was further developed and
refined in the 1960s. Nowadays, it is widely
In the above expression, CV is average predic- used, especially when the data are unstructured
tion error, Lð · , · Þ is a predetermined function, or fewer model assumptions could be made. The
known as the loss function, which measures the cross-validation procedure does not require dis-
difference between the observed response yi and tribution assumptions, which makes it flexible
the predicted value ^f k ðxi Þ. Commonly used and robust.
loss functions Lðy, ^f Þ include the squared loss
2
function y ^f ; the absolute loss function Choosing Parameter
y ^f ; the 0 1 loss function, which is One of the most successful applications of
cross-validation is to choose a smoothing parame-
0 if y ¼ ^f and 1 otherwise; and the cross-entropy
^ ter or penalty coefficient. For example, a researcher
loss function 2logPðY ¼ yi jxi Þ. wants to find the best function f for prediction
The result CV of K-fold cross-validation purposes but is reluctant to add many restrictions.
depends on the value of K used. Theoretically, K To avoid the overfitting problem, the researcher
can be any integer between 2 and the data size tries to minimize a penalized residual sum of
N. Typical values of K include 5 and 10. Gener- squares defined as follows:
ally speaking, when K is small, CV tends to
overestimate the prediction error because each X
N Z 2
2 00
training part is only a fraction of the full data RSSðf ; λÞ ¼ ½yi f ðxi Þ þ λ ½f ðxÞ dx:
set. As K gets close to N, the expected bias i¼1
Two major concerns need to be addressed for with the aid of tables and graphs and may be put
the procedure. One of them is that the estimated together for both ungrouped and grouped scores.
prediction error may strongly depend on how
the data are divided. The value of CV itself is
random if it is calculated on the basis of a ran- Cumulative Frequency Tables for
dom partition. In order to compare two random
Distributions With Ungrouped Scores
CVs, one may need to repeat the whole proce-
dure for many times and compare the CV values A cumulative frequency table for distributions with
on average. Another concern is that the selected ungrouped scores typically includes the scores a vari-
model may succeed for one data set but fail for able takes in a particular sample, their frequencies,
another because cross-validation is a data-driven and the cumulative frequency. In addition, the table
method. In practice, people tend to test the may include the cumulative relative frequency or
selected model again when additional data are proportion, and the cumulative percentage fre-
available via other sources before they accept it quency. Table 1 illustrates the frequency, cumulative
as the best one. frequency, cumulative relative frequency, and cumu-
Cross-validation may also be needed during fit- lative percentage frequency for a set of data show-
ting a model as part of the model selection proce- ing the number of credits a sample of students at
dure. In the previous example, cross-validation is a college have registered for in the autumn quarter.
used for choosing the most appropriate smoothing The cumulative frequency is obtained by adding
parameter λ. In this case, the data may be divided the frequency of each observation to the sum
into three parts: training data, validation data, and of the frequencies of all previous observations
testing data. One may use the validation part to (which is, actually, the cumulative frequency on
choose the best λ for fitting the model and use the the previous row). For example, the cumulative
testing part for model selection. frequency for the first row in Table 1 is 1 because
there are no previous observations. The cumulative
Jie Yang frequency for the second row is 1 þ 0 ¼ 1. The
cumulative frequency for the third row is
See also Bootstrapping; Jackknife
1 þ 2 ¼ 3. The cumulative frequency for the
fourth row is 3 þ 1 ¼ 4, and so on. This means
Further Readings that four students have registered for 13 credits or
fewer in the autumn quarter. The cumulative fre-
Efron, B., & Tibshirani, R. (1993). An introduction to quency for the last observation must equal the
the bootstrap. New York: Chapman & Hall/CRC.
number of observations included in the sample.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
elements of statistical learning: Data mining,
Cumulative relative frequencies or cumulative
inference, and prediction (2nd ed.). New York: proportions are obtained by dividing each cumula-
Springer. tive frequency by the number of observations.
Cumulative proportions show the proportion of
observations that fulfill a particular criterion or
less. For example, the proportion of students who
CUMULATIVE FREQUENCY have registered for 14 credits or fewer in the
autumn quarter is 0.60. The cumulative propor-
DISTRIBUTION tion for the last observation (last row) is always 1.
Cumulative percentages are obtained by multi-
Cumulative frequency distributions report the fre- plying the cumulative proportions by 100. Cumu-
quency, proportion, or percentage of cases at a par- lative percentages show the percentage of
ticular score or less. Thus, the cumulative observations that fulfill a certain criterion or less.
frequency of a score is calculated as the frequency For example, 40% of students have registered for
of occurrence of that score plus the sum of the fre- 13 credits or fewer in the autumn quarter. The
quencies of all scores with a lower value. Cumula- cumulative percentage of the last observation (the
tive frequency distributions are usually displayed last row) is always 100.
Cumulative Frequency Distribution 317
Table 1 Cumulative Frequency Distribution of the Number of Credits Students Have Registered for in the
Autumn Quarter
Number of Cumulative Cumulative relative Cumulative percentage
credits Frequency frequency frequency frequency
y f cumðf Þ cumrðf Þ ¼ cumðf Þ=n cumpðf Þ ¼ 100 cumrðf Þ
10 1 1 0.10 10.00
11 0 1 0.10 10.00
12 2 3 0.30 30.00
13 1 4 0.40 40.00
14 2 6 0.60 60.00
15 4 10 1.00 100.00%
n ¼ 10
12 16
Cumulative Frequency
Cumulative Frequency
10 14
12
8
10
6
8
4 6
2 4
0 2
9 10 11 12 13 14 15 16 0
10 15 20 25 30 35 40
Number of Credits
Workers’ Salaries, Thousands $
Hamilton, L. (1996). Data analysis for social sciences: A Kolstoe, R. H. (1973). Introduction to statistics for the
first course in applied statistics. Belmont, CA: behavioral sciences. Homewood, IL: Dorsey Press.
Wadsworth. Lindquist, E. F. (1942). A first course in statistics: Their
Kiess, H. O. (2002). Statistical concepts for the use and interpretation in education and psychology.
behavioral sciences. Boston: Allyn and Bacon. Cambridge, MA: Riverside Press.
D
and indirect costs are associated with obtaining
DATABASES access to specific populations for collection of spe-
cific data. This limitation is eliminated by using
large-scale databases. Depending on the topic of
One of the most efficient and increasingly common interest, the use of databases provides researchers
methods of investigating phenomena in the educa- access to randomly sampled and nationally repre-
tion and social sciences is the use of databases. sentative populations.
Large-scale databases generally comprise informa- Databases also provide researchers with access
tion collected as part of a research project. Infor- to populations they may not have had access to
mation included in databases ranges from survey individually. Specifically, the recruitment of indivi-
data from clinical trials to psychoeducational data duals from diverse backgrounds (e.g., Black,
from early childhood projects. Research projects Latino) has generally been a problem in the social
from which databases are derived can be longitudi- and medical sciences due to historical issues center-
nal or cross-sectional in nature, use multiple or ing on mistrust of researchers (e.g., the Tuskegee
individual informants, be nationally representative Experiment). While this is the case, databases such
or specific to a state or community, and be primary as the National Institute of Mental Health–funded
data for the original researcher or secondary data Collaborative Psychiatric Epidemiology Surveys
for individuals conducting analysis at a later time. (CPES) provide access to diverse subjects. Specifi-
This entry explores the benefits and limitations of cally, CPES joins together three nationally repre-
using databases in research, describes how to locate sentative surveys: the National Comorbidity
databases, and discusses the types of databases and Survey Replication (NCS-R), the National Survey
the future of the use of databases in research. of American Life (NSAL), and the National Latino
and Asian American Study (NLAAS). These stud-
Benefits ies collectively provide the first national data with
sufficient power to investigate cultural and ethnic
The primary advantage of using databases for influences on mental disorders. Although existing
research purposes is related to economics. Specifi- databases offer numerous benefits, they have lim-
cally, since databases consist of information that itations as well.
has already been collected, they save researchers
time and money because the data are readily avail-
able. As with many investigators, the primary hin-
Limitations
drance to conducting original field research is
limited monetary resources. Collecting data from The key limitation of using databases is that ques-
large samples is time-consuming, and many direct tions and the theoretical orientation of the original
321
322 Databases
these endeavors, the use of databases will remain individual and household characteristics of youth with
a staple in research activities. disabilities: A report from the National Longitudinal
Transition Study-2 (NLTS2). Menlo Park, CA: SRI
Scott Graves International.
Further Readings
Anderson, C., Fletcher, P., & Park, J. (2007). Early
DATA CLEANING
Childhood Longitudinal Study, Birth Cohort (ECLS–
B) Psychometric Report for the 2-year Data Collection Data cleaning, or data cleansing, is an important
(NCES 2007–084). Washington, DC: National Center part of the process involved in preparing data for
for Education Statistics. analysis. Data cleaning is a subset of data prepara-
Landrigan, P., Trasande, L., Thorpe, L., Gwynn, C., Lioy, tion, which also includes scoring tests, matching
P., D’Alton, M., et al. (2006). The national children’s data files, selecting cases, and other tasks that are
study: A 21-year prospective study of 10000 American required to prepare data for analysis.
children. Pediatrics, 118, 2173–2186.
Missing and erroneous data can pose a signifi-
Markowitz, J., Carlson, E., Frey, W., Riley, J., Shimshak,
cant problem to the reliability and validity of
A., Heinzen, H., et al. (2006). Preschoolers’
characteristics, services, and results: Wave 1 overview study outcomes. Many problems can be avoided
report from the Pre-Elementary Education through careful survey and study design. During
Longitudinal Study (PEELS). Rockville, MD: Westat. the study, watchful monitoring and data cleaning
NICHD Early Child Care Research Network. can catch problems while they can still be fixed.
(1993).Child care debate: Transformed or distorted?. At the end of the study, multiple imputation pro-
American Psychologist, 48, 692–693. cedures may be used for data that are truly
Pennell, B., Bowers, A., Carr, D., Chardoul, S., Cheung, irretrievable.
G., Dinkelmann, K., et al. (2004). The development The opportunities for data cleaning are depen-
and implementation of the National Comorbidity
dent on the study design and data collection meth-
Survey Replication, the National Survey of American
Life, and the National Latino and Asian American
ods. At one extreme is the anonymous Web survey,
Survey. International Journal of Methods in with limited recourse in the case of errors and
Psychiatric Research, 13, 241–269. missing data. At the other extreme are longitudinal
Rock, D., & Pollack, J. (2002). Early Childhood studies with multiple treatment visits and outcome
Longitudinal Study—Kindergarten Class of 1998–99 evaluations. Conducting data cleaning during the
(ECLS-K) psychometric report for kindergarten course of a study allows the research team to
through first grade (NCES 2002–05). Washington, obtain otherwise missing data and can prevent
DC: National Center for Education Statistics. costly data cleaning at the end of the study. This
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W., & entry discusses problems associated with data
Shavelson, R. (2007). Estimating causal effects: Using
cleaning and their solutions.
experimental and observational designs. Washington,
DC: American Educational Research Association.
U.S. Department of Health and Human Services. (2002).
A descriptive study of Head Start families: FACES
Types of ‘‘Dirty Data’’
technical report I. Washington, DC: U.S. Department
of Health and Human Services, Administration for Two types of problems are encountered in data
Children and Families. cleaning: missing data and errors. The latter may
Wagner, M., Kutash, K., Duchnowski, A., & Epstein, M. be the result of respondent mistakes or data
(2005). The special education elementary longitudinal
entry errors. The presence of ‘‘dirty data’’
study and the national longitudinal transition study:
Study designs and implications for children and youth
reduces the reliability and validity of the mea-
with emotional disturbance. Journal of Emotional and sures. If responses are missing or erroneous, they
Behavioral Disorders, 13, 25–41. will not be reliable over time. Because reliability
Wagner, M., Marder, C., Levine, P., Cameto, P., sets the upper bound for validity, unreliable
Cadwallader, T., & Blackorby, J. (2003). The items reduce validity.
326 Data Cleaning
Table 1 Data Errors and Missing Data sciences. Even with the best intentions, everyone
Incomplete, makes errors. In the social sciences, most measures
Incorrect, are self-report. Potentially embarrassing items can
or Missing result in biased responses. Lack of motivation is
Variable ‘‘True’’ Data Data also an important source of error. For example,
Name Maria Margaret Smith Maria Smith respondents will be highly motivated in high-
Date of birth 2/19/1981 1981 stakes testing such as the College Board exams but
Sex F M probably do not bring the same keen interest to
Ethnicity Hispanic and Caucasian one’s research study.
Education B.A., Economics College
Place of birthNogales, Sonora, Nogales Solutions and Approaches
Mexico
Annual income $50,000 Data problems can be prevented by careful study
design and by pretesting of the entire research
protocol. After the forms have been collected,
Missing Data the task of data cleaning begins. The following
discussion of data cleaning is for a single paper-
Missing data reduce the sample size available
and-pencil survey collected in person. Data
for the analyses. An investigator’s research design
cleaning for longitudinal studies, institutional
may require 100 respondents in order to have suf-
data sets, and anonymous surveys is addressed in
ficient power to test the study hypotheses. Substan-
a later section.
tial effort may be required to recruit and treat 100
respondents. At the end of the study, if there are
10 important variables, with each variable missing
Missing Data
only 5% of the time, the investigator may be
reduced to 75 respondents with complete data for The best approach is to fill in missing data as
the analyses. Missing data effectively reduce the soon as possible. If a data collection form is
power of the study. Missing data can also intro- skimmed when it is collected, the investigator may
duce bias because questions that may be embarras- be able to ask questions at that time about any
sing or reveal anything illegal may be left blank. missing items. After the data are entered, the files
For example, if some respondents do not answer can be examined for remaining missing values. In
items about income, place of birth (for immigrants many studies, the team may be able to contact the
without documents), or drug use, the remaining respondent or fill in basic data from memory. Even
cases with complete data are a biased sample that if the study team consists of only the principal
is no longer representative of the population. investigator, it is much easier to fill in missing data
after an interview than to do so a year later. At
Data Errors that point, the missing data may no longer be
retrievable.
Data errors are also costly to the study because
lowered reliability attenuates the results. Respon-
dents may make mistakes, and errors can be intro- Data Entry Errors
duced during data entry. Data errors are more
difficult to detect than missing data. Table 1 shows A number of helpful computer procedures can
examples of missing data (ethnicity, income), be used to reduce or detect data entry errors, such
incomplete data (date and place of birth), and as double entry or proactive database design. Dou-
erroneous data (sex). ble entry refers to entering the data twice, in order
to ensure accuracy. Careful database design
includes structured data entry screens that are lim-
Causes
ited to specified formats (dates, numbers, or text)
All measuring instruments are flawed, regard- or ranges (e.g., sex can only be M or F, and age
less of whether they are in the physical or social must be a number less than 100).
Data Cleaning 327
other instruments. For example, one form from data are often greater than those required for the
a testing battery may be missing a date that can be actual analyses.
inferred from the rest of the packet if the error is
caught during data entry.
Longitudinal studies provide a wealth of oppor- Data Imputation
tunity to correct errors, provided early attention to At the end of the data-cleaning process, there
data cleaning and data entry have been built into may still be missing values that cannot be recov-
the study. Identification of missing data and errors ered. These data can be replaced using data impu-
while the respondent is still enrolled in the study tation techniques. Imputation refers to the process
allows investigators to ‘‘fill in the blanks’’ at the of replacing a missing value with a reasonable
next study visit. Longitudinal studies also provide value. Imputation methods range from mean
the opportunity to check for consistency across imputation (replacing a missing data point with
time. For example, if a study collects physical mea- the average for the entire study) to hot deck impu-
surements, children should not become shorter tation (making estimates based on a similar but
over time, and measurements should not move complete data set), single imputation if the propor-
back and forth between feet and meters. tion of missing values is small, and multiple impu-
Along with opportunities to catch errors, stud- tation. Multiple imputation remains the best
ies with multiple forms and/or multiple assessment technique and will be necessary if the missing data
waves also pose problems. The first of these is the are extensive or the values are not missing at ran-
file-matching problem. Multiple forms must be dom. All methods of imputation are considered
matched and merged via computer routines. Dif- preferable to case deletion, which can result in
ferent data files or forms may have identification a biased sample.
numbers and sorting variables that do not exactly
match, and these must be identified and changed Melinda Fritchoff Davis
before matching is possible.
See also Bias; Error; Random Error; Reliability;
Systematic Error; True Score; Validity of Measurement
Documentation
Researchers are well advised to keep a log of Further Readings
data corrections in order to track changes. For
example, if a paper data collection form was used, Allison, P. D. (2002). Missing data. Thousand Oaks, CA:
Sage.
the changes can be recorded on the form, along
Davis, M. F. (2010). Avoiding data disasters and other
with date the item was corrected. Keeping a log of pitfalls. In S. Sidani & D. L. Streiner (Eds.), When
data corrections can save the research team from research goes off the rails (pp. 320–326). New York:
trying to clean the same error more than once. Guilford.
Rubin, D. B. (1987). Multiple imputation for non-
response in surveys. Hoboken, NJ: Wiley.
Data Integrity
Researcher inexperience and the ubiquitous lack
of resources are the main reasons for poor data
hygiene. First, experience is the best teacher, and DATA MINING
few researchers have been directly responsible for
data cleaning or data stewardship. Second, every Modern researchers in various fields are confronted
study has limited resources. At the beginning of by an unprecedented wealth and complexity of
the study, the focus will invariably be on develop- data. However, the results available to these
ing data collection forms and study recruitment. If researchers through traditional data analysis techni-
data are not needed for annual reports, they may ques provide only limited solutions to complex
not be entered until the end of the study. The data situations. The approach to the huge demand for
analyst may be the first person to actually see the the analysis and interpretation of these complex
data. At this point, the costs required to clean the data is managed under the name of data mining, or
Data Mining 329
knowledge discovery. Data mining is defined as the complete linkage, average linkage, and Ward’s
process of extracting useful information from large method.
data sets through the use of any relevant data anal- Nonhierarchical clustering algorithms achieve
ysis techniques developed to help people make the purpose of clustering analysis without building
better decisions. These data mining techniques a hierarchical structure. The k-means clustering
themselves are defined and categorized according to algorithm is one of the most popular nonhierarchi-
their underlying statistical theories and computing cal clustering methods. A brief summary of the
algorithms. This entry discusses these various data k-means clustering algorithm is as follows: Given
mining methods and their applications. k seed (or starting) points, each observation is
assigned to one of the k seed points close to the
observation, which creates k clusters. Then seed
Types of Data Mining
points are replaced with the mean of the currently
In general, data mining methods can be separated assigned clusters. This procedure is repeated with
into three categories: unsupervised learning, super- updated seed points until the assignments do not
vised learning, and semisupervised learning meth- change. The results of the k-means clustering algo-
ods. Unsupervised methods rely solely on the input rithm depend on the distance metrics, the number
variables (predictors) and do not take into account of clusters (k), and the location of seed points.
output (response) information. In unsupervised Other nonhierarchical clustering algorithms
learning, the goal is to facilitate the extraction of include k-medoids and self-organizing maps.
implicit patterns and elicit the natural groupings Principal components analysis (PCA) is another
within the data set without using any information unsupervised technique and is widely used, primar-
from the output variable. On the other hand, ily for dimensional reduction and visualization.
supervised learning methods use information from PCA is concerned with the covariance matrix of
both the input and output variables to generate the original variables, and the eigenvalues and eigen-
models that classify or predict the output values of vectors are obtained from the covariance matrix.
future observations. The semisupervised method The product of the eigenvector corresponding to
mixes the unsupervised and supervised methods to the largest eigenvalue and the original data matrix
generate an appropriate classification or prediction leads to the first principal component (PC), which
model. expresses the maximum variance of the data set.
The second PC is then obtained via the eigenvector
corresponding to the second largest eigenvalue, and
Unsupervised Learning Methods
this process is repeated N times to obtain N PCs,
Unsupervised learning methods attempt to where N is the number of variables in the data set.
extract important patterns from a data set without The PCs are uncorrelated to each other, and gener-
using any information from the output variable. ally the first few PCs are sufficient to account for
Clustering analysis, which is one of the unsuper- most of the variations. Thus, the PCA plot of obser-
vised learning methods, systematically partitions vations using these first few PC axes facilitates visu-
the data set by minimizing within-group variation alization of high-dimensional data sets.
and maximizing between-group variation. These
variations can be measured on the basis of a variety
Supervised Learning Methods
of distance metrics between observations in the
data set. Clustering analysis includes hierarchical Supervised learning methods use both the input
and nonhierarchical methods. and output variables to provide the model or rule
Hierarchical clustering algorithms provide that characterizes the relationships between the
a dendrogram that represents the hierarchical input and output variables. Based on the character-
structure of clusters. At the highest level of this istics of the output variable, supervised learning
hierarchy is a single cluster that contains all the methods can be categorized as either regression or
observations, while at the lowest level are clusters classification. In regression problems, the output
containing a single observation. Examples of hier- variable is continuous, so the main goal is to pre-
archical clustering algorithms are single linkage, dict the outcome values of an unknown future
330 Data Mining
observation. In classification problems, the output including decision trees, support vector machines,
variable is categorical, and the goal is to assign k-nearest neighbors, and artificial neural networks.
existing labels to an unknown future observation. Decision tree models have gained huge popularity
Linear regression models have been widely in various areas because of their flexibility and
used in regression problems because of their sim- interpretability. Decision tree models are flexible in
plicity. Linear regression is a parametric that the models can efficiently handle both contin-
approach that provides a linear equation to uous and categorical variables in the model con-
examine relationships of the mean response to struction. The output of decision tree models is
one or to multiple input variables. Linear regres- a hierarchical structure that consists of a series of
sion models are simple to derive, and the final if–then rules to predict the outcome of the
model is easy to interpret. However, the para- response variable, thus facilitating the interpreta-
metric assumption of an error term in linear tion of the final model. From an algorithmic point
regression analysis often restricts its applicability of view, the decision tree model has a forward
to complicated multivariate data. Further, linear stepwise procedure that adds model terms and
regression methods cannot be employed when a backward procedure for pruning, and it conducts
the number of variables exceeds the number of variable selection by including only useful vari-
observations. Multivariate adaptive regression ables in the model. Support vector machine (SVM)
spline (MARS) is a nonparametric regression is another supervised learning model popularly
method that compensates for limitation of used for both regression and classification pro-
ordinary regression models. MARS is one of the blems. SVMs use geometric properties to obtain
few tractable methods for high-dimensional a separating hyperplane by solving a convex opti-
problems with interactions, and it estimates mization problem that simultaneously minimizes
a completely unknown relationship between the generalization error and maximizes the geo-
a continuous output variable and a number of metric margin between the classes. Nonlinear
input variables. MARS is a data-driven statistical SVM models can be constructed from kernel func-
linear model in which a forward stepwise algo- tions that include linear, polynomial, and radial
rithm is first used to select the model term and is basis functions. Another useful supervised learning
then followed by a backward procedure to prune method is k-nearest neighbors (kNNs). A type of
the model. The approximation bends at ‘‘knot’’ lazy-learning (instance-based learning) technique,
locations to model curvature, and one of the kNNs do not require a trained model. Given
objectives of the forward stepwise algorithm is a query point, the k closest points are determined.
to select the appropriate knots. Smoothing at the A variety of distance measures can be applied to
knots is an option that may be used if derivatives calculate how close each point is to the query
are desired. point. Then the k nearest points are examined to
Classification methods provide models to clas- find which of the categories belong to the k nearest
sify unknown observations according to the exist- points. Last, this category is assigned to the query
ing labels of the output variable. Traditional point being examined. This procedure is repeated
classification methods include linear discriminant for all the points that require classification. Finally,
analysis (LDA) and quadratic discriminant analysis artificial neural networks (ANNs), inspired by the
(QDA), based on Bayesian theory. Both LDA and way biological nervous systems learn, are widely
QDA assume that the data set follows normal dis- used for prediction modeling in many applications.
tribution. LDA generates a linear decision bound- ANN models are typically represented by a net-
ary by assuming that populations of different work diagram containing several layers (e.g.,
classes have the same covariance. QDA, on the input, hidden, and output layers) that consist of
other hand, does not have any restrictions on the nodes. These nodes are interconnected with
equality of covariance between two populations weighted connection lines whose weights are
and provides a quadratic equation that may be adjusted when training data are presented to the
efficient for linearly nonseparable data sets. ANN during the training process. The neural net-
Many supervised learning methods can handle work training process is an iterative adjustment of
both regression and classification problems, the internal weights to bring the network’s output
Data Mining 331
closer to the desired values through minimizing market basket analysis provides a way to under-
the mean squared error. stand the behavior of profitable customers by ana-
lyzing their purchasing patterns. Further,
unsupervised clustering analyses can be used to
Semisupervised Learning Methods
segment customers by market potential. In the tele-
Semisupervised learning approaches have communication industries, data mining methods
received increasing attention in recent years. Olivier help sales and marketing people establish loyalty
Chapelle and his coauthors described semisuper- programs, develop fraud detection modules, and
vised learning as ‘‘halfway between supervised and segment markets to reduce revenue loss. Data min-
unsupervised learning’’ (p. 4). Semisupervised learn- ing has received tremendous attention in the field
ing methods create a classification model by using of bioinformatics, which deals with large amounts
partial information from the labeled data. One- of high-dimensional biological data. Data mining
class classification is an example of a semisupervised methods combined with microarray technology
learning method that can distinguish between the allow monitoring of thousands of genes simulta-
class of interest (target) and all other classes (out- neously, leading to a greater understanding of
lier). In the construction of the classifiers, one-class molecular patterns. Clustering algorithms use
classification techniques require only the informa- microarray gene expression data to group the
tion from the target class. The applications of one- genes based on their level of expression, and classi-
class classification include novelty detection, outlier fication algorithms use the labels of experimental
detection, and imbalanced classification. conditions (e.g., disease status) to build models to
Support vector data description (SVDD) is classify different experimental conditions.
a one-class classification method that combines
a traditional SVM algorithm with a density
approach. SVDD produces a classifier to separate Data Mining Software
the target from the outliers. The decision boundary
A variety of data mining software is available. SAS
of SVDD is constructed from an optimization
Enterprise Miner (www.sas.com), SPSS (an IBM
problem that minimizes the volume of the
company, formerly called PASWâ Statistics) Clem-
hypersphere from the boundary and maximizes
entine (www.spss.com), and S-PLUS Insightful
the target data being captured by the boundary.
Miner (www.insightful.com) are examples of
The main difference between the supervised and
widely used commercial data mining software.
semisupervised classification methods is that the
In addition, commercial software developed by
former generates a classifier to classify an
Salford Systems (www.salford-systems.com) pro-
unknown observation into the predefined classes,
vides CART, MARS, TreeNet, and Random For-
whereas the latter gives a closed-boundary around
ests for specialized uses of tree-based models. Free
the target data in order to separate them from all
data mining software packages also are available.
other types of data.
These include RapidMiner (rapid-i.com), Weka
(www.cs.waikato.ac.nz/ml/weka), and R (www
Applications .r-project.org).
Interest in data mining has increased greatly Seoung Bum Kim and Thuntee Sukchotrat
because of the availability of new analytical techni-
ques with the potential to retrieve useful informa- See also Exploratory Data Analyis; Exploratory Factor
tion or knowledge from vast amounts of complex Analysis; Ex Post Facto Study
data that were heretofore unmanageable. Data
mining has a range of applications, including
manufacturing, marketing, telecommunication, Further Readings
health care, biomedicine, e-commerce, and sports. Chapelle, O., Zien, A., & Schölkopf, B. (Eds.). (2006).
In manufacturing, data mining methods have been Semi-supervised learning. Cambridge: MIT Press.
applied to predict the number of product defects in Duda, R. O., Hart, P. E., & Storl, D. G. (2001). Pattern
a process and identify their causes. In marketing, classification (2nd ed.). New York: Wiley.
332 Data Snooping
Hastie, T., Tibshirani, R., & Friedman, J. (2001). conclusions of statistical significance at the 5%
The elements of statistical learning. New York: level based on an analysis such as this are mis-
Springer. leading because the data-snooping process has
Mitchell, T. M. (1997). Machine learning. New York: essentially ensured that something significant
McGraw-Hill.
will be found. This means that if new data are
Tax, D. M. J., & Duin, R. P. W. (2004). Support
vector data description. Machine Learning, 54,
obtained, it is unlikely that the ‘‘significant’’
45–66. results found via the data-snooping process
would be replicated.
Data-Snooping Examples
DATA SNOOPING
Example 1
The term data snooping, sometimes also referred An investigator obtains data to investigate the
to as data dredging or data fishing, is used impact of a treatment on the mean of a response
to describe the situation in which a particular variable of interest without a predefined view
data set is analyzed repeatedly without an (alternative hypothesis) of the direction (positive
a priori hypothesis of interest. The practice of or negative) of the possible effect of the treat-
data snooping, although common, is problematic ment. Data snooping would occur in this situa-
because it can result in a significant finding (e.g., tion if after analyzing the data, the investigator
rejection of a null hypothesis) that is nothing observes that the treatment appears to have
more than a chance artifact of the repeated anal- a negative effect on the response variable and
yses of the data. The biases introduced by data then uses a one-sided alternative hypothesis cor-
snooping increase the more a data set is analyzed responding to the treatment having a negative
in the hope of a significant finding. Empirical effect. In this situation, a two-sided alternative
research that is based on experimentation and hypothesis, corresponding to the investigator’s
observation has the potential to be impacted by a priori ignorance on the effect of the treatment,
data snooping. would be appropriate. Data snooping in this
example results in the p value for the hypothesis
test being halved, resulting in a greater chance of
Data Snooping and assessing a significant effect of the treatment. To
Multiple Hypothesis Testing avoid problems of this nature, many journals
require that two-sided alternatives be used for
A hypothesis test is conducted at a significance
hypothesis tests.
level, denoted α, corresponding to the probabil-
ity of incorrectly rejecting a true null hypothesis
(the so-called Type 1 error). Data snooping
Example 2
essentially involves performing a large number
of hypothesis tests on a particular data set with A data set containing information on a response
the hope that one of the tests will be significant. variable and six explanatory variables is analyzed,
This data-snooping process of performing a large without any a priori hypotheses of interest, by fit-
number of hypothesis tests results in the actual ting each of the 64 multiple linear regression mod-
significance level being increased, or the burden els obtained by means of different combinations of
of proof for finding a significant result being sub- the six explanatory variables, and then only statis-
stantially reduced, resulting in potentially mis- tically significant associations are reported. The
leading results. For example, if 100 independent effect of data snooping in this example would be
hypothesis tests are conducted on a data set at more severe than in Example 1 because the data
a significance level of 5%, it would be expected are being analyzed many more times (more
that about 5 out of the 100 tests would yield sig- hypothesis tests are performed), meaning that one
nificant results simply by chance alone, even if would expect to see a number of significant asso-
the null hypothesis were, in fact, true. Any ciations simply due to chance.
Debriefing 333
Correcting for Data Snooping See also Bonferroni Procedure; Data Mining; Hypothesis;
Multiple Comparison Tests; p Value; Significance
The ideal way to avoid data snooping is for an Level, Interpretation and Construction; Type I Error
investigator to verify any significant results found
via a data-snooping process by using an indepen- Further Readings
dent data set. Significant results not replicated on
the independent data set would then be viewed as Freedman, D., Pisani, R., & Purves, R. (2007). Statistics
spurious results that were likely an artifact of the (4th ed.). New York: W. W. Norton.
Romano, J. P., & Wolf, M. (2005). Stepwise multiple
data-snooping process. If an independent data set is
testing as formalized data snooping. Econometrika,
obtainable, then the initial data-snooping process 73, 1237–1282.
may be viewed as an initial exploratory analysis Strube, M. J. (2006). SNOOP: A program for
used to inform the investigator of hypotheses of demonstrating the consequences of premature and
interest. In cases in which an independent data set is repeated null hypothesis testing. Behavior Research
not possible or very expensive, the role of an inde- Methods, 38, 24–27.
pendent data set can be mimicked by randomly White, H. (2000). A reality check for data snooping.
dividing the original data into two smaller data sets: Econometrika, 68, 1097–1126.
one half for an initial exploratory analysis (the
training set) and the other half for validation (the
validation set). Due to prohibitive cost and/or time,
obtaining an independent data set or a large enough DEBRIEFING
data set for dividing into training and validation sets
may not be feasible. In such situations, the investi- Debriefing is the process of giving participants fur-
gator should describe exactly how the data were ther information about a study in which they par-
analyzed, including the number of hypothesis tests ticipated at the conclusion of their participation.
that were performed in finding statistically signifi- Debriefing continues the informational process
cant results, and then report results that are that began at the participant recruitment or
adjusted for multiple hypothesis-testing effects. informed consent stage. If the true purpose of the
Methods for adjusting for multiple hypothesis test- study was revealed to participants at the informed
ing include the Bonferroni correction, Scheffes consent stage, debriefing is fairly straightforward.
method, Tukey’s test, and more recently the false Participants are reminded of the purposes of the
discovery rate. The relatively simple Bonferroni cor- study, given further information about expected
rection works by conducting individual hypothesis results, and thanked for their participation. The
tests at level of significance α/g where g is the num- debriefing session also provides an opportunity for
ber of hypothesis tests carried out. Performing the participants to ask any questions they may have
individual hypothesis tests at level of significance α/ about the study. In some research situations, parti-
g provides a crude means of maintaining an overall cipants might be called on to discuss negative emo-
level of significance of at least α. Model averaging tions or reveal sensitive information (e.g., studies
methods that combine information from every anal- on relationship violence or eating disorders). In
ysis of a data set are another alternative for alleviat- such studies, the researcher may include in the
ing the problems of data snooping. debriefing information about ways in which parti-
Data mining, a term used to describe the pro- cipants might obtain help in dealing with these
cess of exploratory analysis and extraction of use- issues, such as a referral to a campus mental health
ful information from data, is sometimes confused center. A debriefing script should be included in
with data snooping. Data snooping is sometimes research proposals submitted to an institutional
the result of the misuse of data-mining methods, review board.
such as the framing of specific alternative hypothe- If a study includes deception, debriefing is more
ses in response to an observation arising out of complex. In such instances, a researcher has con-
data mining. cluded that informing participants of the nature
of the study at the stage of obtaining consent
Michael A. Martin and Steven Roberts would interfere with the collection of valid and
334 Debriefing
generalizable data. In such instances, the authority may create situations in which partici-
researcher may give participants incomplete or pants are expected to engage in behavior with
misleading information about the nature of the which they are uncomfortable (such as administer-
study at the recruitment and consent stages. Other ing supposed electric shocks to a confederate). In
examples of deception in social science research such a situation, a researcher might address possi-
include deceptive instructions, false feedback, or ble negative feelings by stating that the participant’s
the use of confederates (members of the research behavior was not unusual or extreme (by, for
team who misrepresent their identities as part of example, stating that most other participants have
the study procedure). acted the same way). Another approach is to
In a deception study, the debriefing session is emphasize that the behavior was due to situational
the time when a complete explanation of the study factors rather than personal characteristics. Desen-
is given and the deception is revealed. Participants sitizing may encourage participants to make an
should be informed of the deception that took external (situational) rather than an internal (per-
place and of the true purpose of the research. The sonal) attribution for their behavior. Participants
reasons the researcher believed that deception was may feel angry, foolish, or embarrassed about hav-
necessary for the research should also be explained ing been deceived by the researcher. One desensitiz-
to participants. As in a nondeception study, parti- ing technique applicable to such situations is to
cipants should be thanked for their participation point out that negative feelings are a natural and
and provided with an opportunity to ask questions expected outcome of the study situation.
of the researcher. Participants should also be Joan Seiber states that participation in research
reminded of their right to withdraw from the study and postresearch debriefing should provide partici-
at any time. This reminder may take a number of pants with new insight into the topic of research
forms, ranging from a statement in the debriefing and a feeling of satisfaction in having made a con-
script indicating participants’ ability to withdraw, tribution to society and to scientific understanding.
to a second informed consent form for participants In a deceptive study, Seiber states, participants
to sign after being debriefed. should receive a number of additional benefits
from the debriefing: dehoaxing, desensitizing, an
opportunity to ask questions of the researcher, an
The Debriefing Process
opportunity to end participation in the study, res-
David Holmes has argued that debriefing should toration of confidence in scientific research, and
include processes of dehoaxing (if necessary) and information on the ways in which possible harm
desensitizing. Dehoaxing involves informing parti- has been anticipated and avoided. Seiber also
cipants about any deception that was used in the states that the dehoaxing process should include
study and explaining the researcher’s rationale for a convincing demonstration of the deception (for
the use of deception. Desensitizing involves dis- example, showing participants two identical com-
cussing and attempting to diminish any negative pleted tasks, one with positive feedback and one
feelings (such as stress or anxiety) that may have with negative feedback).
arisen as a result of the research process.
Negative feelings may result from the research
Types of Debriefing
process for a number of reasons. The purpose of
the research may have been to study these feelings, Several types of debriefing are associated with
and thus researchers may have deliberately deception studies. In each type, the researcher
instigated them in participants. For example, describes the deceptive research processes, explains
researchers interested in the effects of mood on test the reasons research is conducted on this topic and
performance might ask participants to read why deception was felt necessary to conduct the
an upsetting passage before completing a test. Neg- research, and thanks the participant for his or her
ative feelings may also arise as a consequence assistance in conducting the research.
of engaging in the behavior that researchers An explicit or outcome debriefing focuses on
were interested in studying. For example, research- revealing the deception included in the study.
ers interested in conformity and compliance to Explicit debriefing would include a statement
Debriefing 335
about the deceptive processes. Explicit debriefing after deception include the potential for debriefing
might also include a concrete demonstration of the to exacerbate harm by emphasizing the deceptive-
deception, such as demonstrating how feedback ness of researchers or for participants not to
was manipulated or introducing the participant to believe the debriefing, inferring that it is still part
the confederate. of the experimental manipulation.
A process debriefing is typically more involved Other experts have argued that it is possible to
than an explicit debriefing and allows for more conduct deception research ethically but have
opportunities for participants to discuss their feel- expressed concerns regarding possible negative
ings about participation and reach their own con- outcomes of such research. One such concern
clusions regarding the study. A process debriefing regarding deception and debriefing is the persever-
might include a discussion of whether the partici- ance phenomenon, in which participants continue
pant found anything unusual about the research even after debriefing to believe or be affected by
situation. The researcher might then introduce false information presented in a study. The most
information about deceptive elements of the prominent study of the perseverance phenomenon
research study, such as false feedback or the use of was conducted by Lee Ross, Mark Lepper, and
confederates. Some process debriefings attempt to Michael Hubbard, who were interested in adoles-
lead the participant to a realization of the decep- cents’ responses to randomly assigned feedback
tion on his or her own, before it is explicitly regarding their performance on a decision-making
explained by the researcher. task. At the end of the study session, participants
A somewhat less common type of debriefing is participated in a debriefing session in which they
an action debriefing, which includes an explicit learned that the feedback they had received was
debriefing along with a reenactment of the study unrelated to their actual performance. Ross and
procedure or task. colleagues found that participants’ self-views were
affected by the feedback even after the debriefing.
When, as part of the debriefing process, partici-
Ethical Considerations
pants were explicitly told about the perseverance
Ethical considerations with any research project phenomenon, their self-views did not continue to
typically include an examination of the predicted be affected after the debriefing.
costs (e.g., potential harms) and benefits of the
study, with the condition that research should not
Debriefing in Particular Research Contexts
be conducted unless predicted benefits significantly
outweigh potential harms. One concern expressed Most of the preceding discussion of debriefing has
by ethicists is that the individuals who bear the assumed a study of adult participants in a labora-
risks of research participation (study participants) tory setting. Debriefing may also be used in other
are often not the recipients of the study’s benefits. research contexts, such as Internet research, or
Debriefing has the potential to ameliorate costs with special research populations, such as children
(by decreasing discomfort and negative emotional or members of stigmatized groups. In Internet
reactions) and increase benefits (by giving partici- research, debriefing is typically presented in the
pants a fuller understanding of the importance of form of a debriefing statement as the final page of
the research question being examined and thus the study or as an e-mail sent to participants.
increasing the educational value of participation). In research with children, informed consent
Some experts believe that debriefing cannot be prior to participation is obtained from children’s
conducted in such a way as to make deception parents or guardians; child participants give their
research ethical, because deceptive research prac- assent as well. In studies involving deception of
tices eliminate the possibility for truly informed child participants, parents are typically informed
consent. Diana Baumrind has argued that debrief- of the true nature of the research at the informed
ing is insufficient to remediate the potential harm consent stage but are asked not to reveal the
caused by deception and that research involving nature of the research project to their children
intentional deception is unethical and should not prior to participation in the study. After study par-
be conducted. Other arguments against debriefing ticipation, children participate in a debriefing
336 Decision Rule
session with the researcher (and sometimes with T. L. Beauchamp, R. R. Faden, R. J. Wallace Jr., 245).
a parent or guardian as well). In this session, the Baltimore: Johns Hopkins University Press.
researcher explains the nature of and reasons for Fisher, C. B. (2005). Deception research involving
the deception in age-appropriate language. children: Ethical practices and paradoxes.
Ethics 287.
Marion Underwood has advocated for the use
Holmes, D. S. (1976). Debriefing after psychological
of a process debriefing with children. Underwood experiments: I. Effectiveness of postdeception
has also argued that it is important for the decep- dehoaxing. American Psychologist, 31, 868–875.
tion and debriefing to take place within a larger Hurley, J. C., & Underwood, M. K. Children’s
context of positive interactions. For example, chil- understanding of their research rights before and after
dren might engage in an enjoyable play session debriefing: Informed assent, confidentiality, and
with a child confederate after being debriefed stopping participation. Child Development, 73,
about the confederate’s role in an earlier 132–143.
interaction. Mills, J. (1976). A procedure for explaining experiments
involving deception. Personality 13.
Sieber, J. E. (1983). Deception in social research III: The
nature and limits of debriefing. IRB: A Review of
Statements by Professional Organizations Human Subjects Research, 5, 1–4.
The American Psychological Association’s Ethical
Principles of Psychologists and Code of Conduct Websites
states that debriefing should be an opportunity for
participants to receive appropriate information American Psychological Association: http://www.apa.org
Society for Research in Child Development’s Ethical
about a study’s aims and conclusions and should
Standards for Research with Children:
include correction of any participant mispercep-
http://www.srcd.org
tions of which the researchers are aware. The
APA’s ethics code also states that if information
must be withheld for scientific or humanitarian
reasons, researchers should take adequate mea- DECISION RULE
sures to reduce the risk of harm. If researchers
become aware of harm to a participant, they
In the context of statistical hypothesis testing, deci-
should take necessary steps to minimize the harm.
sion rule refers to the rule that specifies how to
The ethics code is available on the APA’s Web site.
choose between two (or more) competing hypothe-
The Society for Research in Child Develop-
ses about the observed data. A decision rule speci-
ment’s Ethical Standards for Research with Chil-
fies the statistical parameter of interest, the test
dren state that the researcher should clarify all
statistic to calculate, and how to use the test statis-
misconceptions that may have arisen over the
tic to choose among the various hypotheses about
course of the study immediately after the data are
the data. More broadly, in the context of statistical
collected. This ethics code is available on the
decision theory, a decision rule can be thought of
Society’s Web site.
as a procedure for making rational choices given
Meagan M. Patterson uncertain information.
The choice of a decision rule depends, among
See also Ethics in the Research Process; Informed other things, on the nature of the data, what one
Consent needs to decide about the data, and at what level
of significance. For instance, decision rules used
for normally distributed (or Gaussian) data are
Further Readings generally not appropriate for non-Gaussian data.
Baumrind, D. (1985). Research using intentional
Similarly, decision rules used for determining the
deception: Ethical issues revisited. American 95% confidence interval of the sample mean will
Psychologist, 40, 165–174. be different from the rules appropriate for binary
Elms, A. (1982). Keeping deception honest: Justifying decisions, such as determining whether the sample
conditions for social scientific research strategems. In mean is greater than a prespecified mean value at
Decision Rule 337
a given significance level. As a practical matter, Conceptual quibbles about this view of proba-
even for a given decision about a given data set, bility aside, this approach is entirely adequate for
there is no unique, universally acceptable decision a vast majority of practical purposes in research.
rule but rather many possible principled rules. But for more complex decisions in which a variety
There are two main statistical approaches to of factors and their attendant uncertainties have to
picking the most appropriate decision rule for be considered, frequentist decision rules are often
a given decision. The classical, or frequentist, too limiting.
approach is the one encountered in most textbooks
on statistics and the one used by most researchers
in their data analyses. This approach is generally Bayesian Decision Rules
quite adequate for most types of data analysis. The
Bayesian approach is still widely considered eso- Suppose, in the aforementioned example, that the
teric, but one that an advanced researcher should effectiveness of the hormone for various breeds of
become familiar with, as this approach is becom- cattle in the sample, and the relative frequencies of
ing increasingly common in advanced data analysis the breeds, is known. How should one use this
and complex decision making. prior distribution of hormone effectiveness to
choose between the two hypotheses? Frequentist
decision rules are not well suited to handle such
Decision Rules in Classical Hypothesis Testing
decisions; Bayesian decision rules are.
Suppose one needs to decide whether a new brand Essentially, Bayesian decision rules use Bayes’s
of bovine growth hormone increases the body law of conditional probability to compute a poste-
weight of cattle beyond the known average value rior distribution based on the observed data and the
of μ kilograms. The observed data consist of body appropriate prior distribution. In the case of the
weight measurements from a sample of cattle trea- above example, this amounts to revising one’s belief
ted with the hormone. The default explanation for about the body weight of the treated cattle based
the data, or the null hypothesis, is that there is no on the observed data and the prior distribution.
effect: the mean weight of the treated sample is no The null hypothesis is rejected if the posterior prob-
greater than the nominal mean μ. The alternative ability is less than the user-defined significance level.
hypothesis is that the mean weight of the treated One of the more obvious advantages of Bayes-
sample is greater than μ. ian decision making, in addition to the many sub-
The decision rule specifies how to decide which tler ones, is that Bayesian decision rules can be
of the two hypotheses to accept, given the data. In readily elaborated to allow any number of addi-
the present case, one may calculate the t statistic, tional considerations underlying a complex deci-
determine the critical value of t at the desired level sion. For instance, if the larger decision at hand in
of significance (such as .05), and accept the alter- the above example is whether to market the hor-
native hypothesis if the t value based on the data mone, one must consider additional factors, such
exceeds the critical value and reject it otherwise. If as the projected profits, possible lawsuits, and
the sample is sufficiently large and Gaussian, one costs of manufacturing and distribution. Complex
might use a similar decision rule with a different decisions of this sort are becoming increasingly
test statistic, the z score. Alternatively, one may common in behavioral, economic, and social
choose between the hypotheses based on the p research. Bayesian decision rules offer a statistically
value rather than the critical value. optimal method for making such decisions.
Such case-specific variations notwithstanding, It should be noted that when only the sample
what all frequentist decision rules have in common data are considered and all other factors, including
is that they arrive at a decision ultimately by com- prior distributions, are left out, Bayesian decision
paring some statistic of the observed data against rules can lead to decisions equivalent to and even
a theoretical standard, such as the sampling distri- identical to the corresponding frequentist rules. This
bution of the statistic, and determine how likely superficial similarity between the two approaches
the observed data are under the various competing notwithstanding, Bayesian decision rules are not
hypotheses. simply a more elaborate version of frequentist rules.
338 Declaration of Helsinki
The differences between the two approaches are 1947 Nuremberg Code was established. This was
profound and reflect longstanding debates about followed in 1948 by the WMA’s Declaration of
the nature of probability. For the researcher, on the Geneva, a statement of ethical duties for physicians.
other hand, the choice between the two approaches Both documents influenced the development of the
should be less a matter of adherence to any given Declaration of Helsinki, adopted in 1964 by the
orthodoxy and more about the nature of the deci- WMA. The initial Declaration, 11 paragraphs in
sion at hand. length, focused on clinical research trials. Notably,
it relaxed conditions for consent for participation,
Jay Hegdé changing the Nuremberg requirement that consent
is ‘‘absolutely essential’’ to instead urge consent ‘‘if
See also Criterion Problem; Critical Difference; Error at all possible’’ but to allow for proxy consent, such
Rates; Expected Value; Inference: Inductive and as from a legal guardian, in some instances.
Deductive; Mean Comparisons; Parametric Statistics The Declaration has been revised six times.
The first revision, conducted in 1975, expanded
Further Readings the Declaration considerably, nearly doubling its
length, increasing its depth, updating its termi-
Bolstad, W. M. (2007). Introduction to Bayesian nology, and adding concepts such as oversight by
statistics. Hoboken, NJ: Wiley.
an independent committee. The second (1983)
Press, S. J. (2005). Applied multivariate analysis: Using
Bayesian and frequentist methods of inference. New
and third (1989) revisions were comparatively
York: Dover. minor, primarily involving clarifications and
Resnik, M. D. (1987). Choices: An introduction to updates in terminology. The fourth (1996) revi-
decision theory. Minneapolis: University of Minnesota sion also was minor in scope but notably added
Press. a phrase that effectively precluded the use of
inert placebos when a particular standard of care
exists.
The fifth (2000) revision was extensive and
DECLARATION OF HELSINKI controversial. In the years leading up to the revi-
sion, concerns were raised about the apparent
The Declaration of Helsinki is a formal state- use of relaxed ethical standards for clinical trials
ment of ethical principles published by the in developing countries, including the use of pla-
World Medical Association (WMA) to guide the cebos in HIV trials conducted in sub-Saharan
protection of human participants in medical Africa. Debate ensued about revisions to the
research. The Declaration is not a legally binding Declaration, with some arguing for stronger lan-
document but has served as a foundation for guage and commentary addressing clinical trials
national and regional laws governing medical and others proposing to limit the document to
research across the world. Although not without basic guiding principles. Although consensus
its controversies, the Declaration has served as was not reached, the WMA approved a revision
the standard in medical research ethics since its that restructured the document and expanded its
establishment in 1964. scope. Among the more controversial aspects of
the revision was the implication that standards
of medical care in developed countries should
History and Current Status
apply to any research with humans, including
Before World War II, no formal international state- that conducted in developing countries. The
ment of ethical principles to guide research with opposing view held that when risk of harm is
human participants existed, leaving researchers to low and there are no local standards of care (as
rely on organizational, regional, or national policies is often the case in developing countries), pla-
or their own personal ethical guidelines. After atro- cebo-controlled trials are ethically acceptable,
cities were found to have been committed by Nazi especially given their potential benefits for future
medical researchers using involuntary, unprotected patients. Debate has continued on these issues,
participants drawn from concentration camps, the and cross-national divisions have emerged. The
Declaration of Helsinki 339
U.S. Food and Drug Administration rejected the adds that research also must encourage the protec-
fifth revision because of its restrictions on the tion of the health and rights of people. The intro-
use of placebo conditions and has eliminated all duction then specifically mentions vulnerable
references to the Declaration, replacing it with populations and calls for extra consideration when
the Good Clinical Practice guidelines, an alterna- these populations are participating in research.
tive internationally sanctioned ethics guide. The The final statement in the Declaration’s Introduc-
National Institutes of Health training in research tion asserts that medical researchers are bound by
with human participants no longer refers to the the legal and ethical guidelines of their own
Declaration, and the European Commission nations but that adherence to these laws does not
refers only to the fourth revision. liberate researchers from the edicts of the Declara-
The sixth revision of the Declaration, approved tion of Helsinki.
by the WMA in 2008, introduced relatively minor
clarifications. The revision reinforces the Declara-
tion’s long-held emphasis on prioritizing the rights
Principles for All Medical Research
of individual research participants above all other
interests. Public debate following the revision was The Principles for All Medical Research include
not nearly as contentious as had been the case with considerations that must be made by researchers
previous revisions. who work with human participants. The first
assertion in the principles states that a physician’s
duty is to ‘‘protect the life, health, dignity, integ-
Synopsis of the Sixth Revision
rity, right to self-determination, privacy, and confi-
The Declaration of Helsinki’s sixth revision com- dentiality’’ of research participants. Consideration
prises several sections: the Introduction, Principles of the environment and of the welfare of research
for All Medical Research, and Additional Princi- animals is also mentioned. Also, the basic princi-
ples for Medical Research Combined With Medi- ples declare that any research conducted on human
cal Care. It is 35 paragraphs long. participants must be in accordance with generally
held scientific principles and be based on as thor-
ough a knowledge of the participant as is possible.
Introduction
Paragraph 14 of the Declaration states that any
The introduction states that the Declaration is study using human participants must be thor-
intended for physicians and others who conduct oughly outlined in a detailed protocol, and it pro-
medical research on humans (including human vides specific guidelines about what should be
materials or identifiable information). It asserts included in the protocol. The protocol should
that the Declaration should be considered as include numerous types of information, including
a whole and that its paragraphs should not be con- funding sources, potential conflicts of interest,
sidered in isolation but with reference to all perti- plans for providing study participants access to
nent paragraphs. It then outlines general ethical interventions that the study identifies as beneficial,
principles that guide research on human partici- and more.
pants. These include a reminder of the words from Paragraph 15 states that the above-mentioned
the WMA’s Declaration of Geneva that the physi- protocol must be reviewed by an independent
cian is bound to: ‘‘The health of my patient will be research ethics committee before the study begins.
my first consideration.’’ This idea is expanded with This committee has the right and responsibility to
a statement asserting that when research is being request changes, provide comments and guidance,
conducted, the welfare of the participants takes and monitor ongoing trials. The committee mem-
precedence over the more general welfare of sci- bers also have the right and responsibility to con-
ence, research, and the general population. sider all information provided in the protocol and
The introduction also describes the goals of to request additional information as deemed
medical research as improving the prevention, appropriate. This principle of the Declaration is
diagnosis, and treatment of disease and increasing what has led to the development of institutional
the understanding of the etiology of disease. It review boards in the United States.
340 Declaration of Helsinki
The principles also state that research must be the ethics of accurate publication of research
conducted by qualified professionals and that the results. Researchers are responsible for accurate
responsibility for protecting research subjects and complete reporting of results and for making
always falls on the professionals conducting the their results publicly available, even if the results
study and not on the study participants, even are negative or inconclusive. The publication
though they have consented to participate. should also include funding sources, institutional
The principles also require an assessment of pre- affiliation, and any conflicts of interest. A final
dictable risks and benefits for both research parti- assertion states that research reports that do not
cipants and the scientific community. Risk meet these standards should not be accepted for
management must be carefully considered, and the publication.
objective of the study must be of enough impor-
tance that the potential risks are outweighed.
Another statement in the basic principles, para- Additional Principles for Medical
graph 17, states that research with disadvantaged Research Combined With Medical Care
or vulnerable populations is justified only if it
relates to the needs and priorities of the vulnerable This section of the Declaration, which was new
community and can be reasonably expected to to the fifth revision in 2000, has created the most
benefit the population in which the research is con- controversy. It begins with a statement that extra
ducted. This statement was included, in part, as care must be taken to safeguard the health and
a response to testing of new prescription drugs in rights of patients who are both receiving medical
Africa, where the availability of cutting-edge pre- care and participating in research. Paragraph 32
scription drugs is highly unlikely. then states that when a new treatment method is
The remainder of the principles section dis- being tested, it should be compared with the gener-
cusses issues of privacy, confidentiality, and ally accepted best standard of care, with two
informed consent. These discussions stipulate exceptions. First, placebo treatment can be used in
that research should be conducted only with par- studies where no scientifically proven intervention
ticipants who are capable of providing informed exists. This statement was adopted as a response
consent, unless it is absolutely necessary to do to drug testing that was being conducted in which
research with participants who cannot give con- the control group was given placebos when a scien-
sent. If this is the case, the specific reasons for tifically proven drug was available.
this necessity must be outlined in the protocol, The second exception states that placebos or no
informed consent must be provided by a legal treatment can be used when ‘‘compelling and sci-
guardian, and the research participant’s assent entifically sound methodological reasons’’ exist for
must be obtained if possible. Participants must using a placebo to determine the efficacy and/or
be informed of their right to refuse to participate safety of a treatment, and if the recipients of the
in the study, and special care must be taken placebo or no treatment will not suffer irreversible
when potential participants are under the care of harm. The Declaration then states that ‘‘Extreme
a physician involved in the study in order to care must be taken to avoid abuse of this option.’’
avoid dynamics of dependence on the physician This exception was most likely added as a response
or duress to affect decision-making processes. to the intense criticism of the fifth revision.
Paragraphs 27 through 29 outline guidelines for The adoption of the principle described in para-
research with participants who are deemed graph 32 aimed to prevent research participants’
incompetent to give consent and state that these illnesses from progressing or being transmitted to
subjects can be included in research only if the others because of a lack of drug treatment when
subject can be expected to benefit or if the fol- a scientifically proven treatment existed. Critics of
lowing conditions apply: A population that the this assertion stated that placebo treatment was
participant represents is likely to benefit, the consistent with the standard of care in the regions
research cannot be performed on competent per- where the drug testing was taking place and that
sons, and potential risk and burden are minimal. administration of placebos to control groups is
The final paragraph of the principles addresses often necessary to determine the efficacy of
Degrees of Freedom 341
Future
The Declaration of Helsinki remains the world’s
best-known statement of ethical principles to guide DEGREES OF FREEDOM
medical research with human participants. Its
influence is far-reaching in that it has been codified
In statistics, the degrees of freedom is a measure of
into the laws that govern medical research in coun-
the level of precision required to estimate a param-
tries across the world and has served as a basis for
eter (i.e., a quantity representing some aspect of
the development of other international guidelines
the population). It expresses the number of inde-
governing medical research with human partici-
pendent factors on which the parameter estimation
pants. As the Declaration has expanded and
is based and is often a function of sample size. In
become more prescriptive, it has become more
general, the number of degrees of freedom
controversial, and concerns have been raised
increases with increasing sample size and with
regarding the future of the Declaration and its
decreasing number of estimated parameters. The
authority. Future revisions to the Declaration may
quantity is commonly abbreviated df or denoted
reconsider the utility of prescriptive guidelines
by the lowercase Greek letter nu, ν.
rather than limiting its focus to basic principles.
For a set of observations, the degrees of free-
Another challenge will be to harmonize the Decla-
dom is the minimum number of independent
ration with other ethical research guidelines,
values required to resolve the entire data set. It
because there often is apparent conflict between
is equal to the number of independent observa-
aspects of current codes and directives documents.
tions being used to determine the estimate (n)
Bryan J. Dik and Timothy J. Doenges minus the number of parameters being estimated
in the approximation of the parameter itself,
See also Ethics in the Research Process as determined by the statistical procedure under
342 Degrees of Freedom
consideration. In other words, a mathematical the sum of the deviations about the mean is equal
restraint is used to compensate for estimating one to zero, at least four deviations are needed to
parameter from other estimated parameters. For determine the fifth; hence, one deviation is fixed
a single sample, one parameter is estimated. Often and cannot vary. The number of values that are
the population mean (μ), a frequently unknown free to vary is the degrees of freedom. In this
value, is based on the sample mean ( x), thereby example, the number of degrees of freedom is
resulting in n 1 degrees of freedom for estimat- equal to 4; this is based on five data observations
ing population variability. For two samples, two (n) minus one estimated parameter (i.e., using the
parameters are estimated from two independent sample mean to estimate the population mean).
samples (n1 and n2), thus producing n1 þ n2 2 Generally stated, the degrees of freedom for a sin-
degrees of freedom. In simple linear regression, the gle sample are equal to n 1 given that if n 1
relationship between two variables, x and y, is observations and the sample mean are known, the
described by the equation y ¼ bx þ a, where b is remaining nth observation can be determined.
the slope of the line and a is the y-intercept (i.e., Degrees of freedom are also often used to
where the line crosses the y-axis). In estimating describe assorted data distributions in comparison
a and b to determine the relationship between the with a normal distribution. Used as the basis for
independent variable x and dependent variable y, 2 statistical inference and sampling theory, the nor-
degrees of freedom are then lost. For multiple mal distribution describes a data set characterized
sample groups ðn1 þ þ nk Þ, the number of by a bell-shaped probability density function that
parameters estimated increases by k, and is symmetric about the mean. The chi-square dis-
subsequently, the degrees of freedom is equal to tribution, applied usually to test differences among
n1 þ þ nk k. The denominator in the analysis proportions, is positively skewed with a mean
of variance (ANOVA) F test statistic, for example, defined by a single parameter, the degrees of free-
accounts for estimating multiple population means dom. The larger the degrees of freedom, the more
for each group under comparison. the chi-square distribution approximates a normal
The concept of degrees of freedom is fundamen- distribution. Also based on the degrees of freedom
tal to understanding the estimation of population parameter, the Student’s t distribution is similar to
parameters (e.g., mean) based on information the normal distribution, but with more probability
obtained from a sample. The amount of informa- allocated to the tails of the curve and less to the
tion used to make a population estimate can vary peak. The largest difference between the t distribu-
considerably as a function of sample size. For tion and the normal occurs for degrees of freedom
instance, the standard deviation (a measure of var- less than about 30. For tests that compare the vari-
iability) of a population estimated on a sample size ance of two or more populations (e.g., ANOVA),
of 100 is based on 10 times more information than the positively skewed F distribution is defined by
is a sample size of 10. The use of large amounts of the number of degrees of freedom for the various
independent information (i.e., a large sample size) samples under comparison.
to make an estimate of the population usually Additionally, George Ferguson and Yushio
means that the likelihood that the sample estimates Takane have offered a geometric interpretation of
are truly representative of the entire population is degrees of freedom whereby restrictions placed on
greater. This is the meaning behind the number of the statistical calculations are related to a point–
degrees of freedom. The larger the degrees of free- space configuration. Each point within a space of
dom, the greater the confidence the researcher can d dimensions has a freedom of movement or vari-
have that the statistics gained from the sample ability within those d dimensions that is equal to
accurately describe the population. d; hence, d is the number of degrees of freedom.
To demonstrate this concept, consider a sample For instance, a data point on a single dimensional
data set of the following observations (n ¼ 5): 1, line has one degree of movement (and one degree
2, 3, 4, and 5. The sample mean (the sum of the of freedom) whereas a data point in three-dimen-
observations divided by the number of observa- sional space has three.
tions) equals 3, and the deviations about the mean
are 2, 1, 0, þ 1, and þ 2, respectively. Since Jill S. M. Coleman
Delphi Technique 343
See also Analysis of Variance (ANOVA); Chi-Square Test; exclusive tool of investigation in a research or an
Distribution; F Test; Normal Distribution; Parameters; evaluation project is not uncommon.
Population; Sample Size; Student’s t Test; Variance This entry examines the Delphi process, includ-
ing subject selection and analysis of data. It also
discusses the advantages and disadvantages of the
Further Readings Delphi technique, along with the use of electronic
technologies in facilitating implementation.
Ferguson, G. A., & Takane, Y. (1989). Statistical analysis
in psychology and education (6th ed.). New York:
McGraw-Hill. The Delphi Process
Good, I. J. (1973). What are degrees of freedom?
American Statistician, 27, 227–228. The Delphi technique is characterized by multiple
Lomax, R. G. (2001). An introduction to statistical iterations, or ‘‘rounds,’’ of inquiry. The iterations
concepts for education and the behavioral sciences. mean a series of feedback processes. Due to the
Mahwah, NJ: Lawrence Erlbaum. iterative characteristic of the Delphi technique,
instrument development, data collection, and
questionnaire administration are interconnected
between rounds. As such, following the more or
less linear steps of the Delphi process is important
DELPHI TECHNIQUE to success with this technique.
and fair disclosure of what each participant thinks is considered the most important step in Delphi.
or believes is important concerning the issue being The quality of results directly links to the quality
investigated, as well as providing participants an of the participants involved.
opportunity to share their expertise, which is Delphi participants should be highly trained
a principal reason for their selection to participate and possess expertise associated with the target
in the study. issues. Investigators must rigorously consider and
examine the qualifications of Delphi subjects. In
general, possible Delphi subjects are likely to
Round 3
be positional leaders, authors discovered from
In Round 3, Delphi participants receive a third a review of professional publications concerning
questionnaire that consists of the statements and the topic, and people who have firsthand relation-
ratings summarized by the investigators after the ships with the target issue. The latter group often
preceding round. Participants are again asked to consists of individuals whose opinions are sought
revise their judgments and to express their ratio- because their direct experience makes them a reli-
nale for their priorities. This round provides able source of information.
participants an opportunity to make further clarifi- In Delphi, the number of participants is gener-
cations and review previous judgments and inputs ally between 15 and 20. However, what constitu-
from the prior round. Researchers have indicated tes an ideal number of participants in a Delphi
that three rounds are often sufficient to gather study has never achieved a consensus in the litera-
needed information and that further iterations ture. Andre Delbecq, Andrew Van de Ven, and
would merely generate slight differences. David Gustafson suggest that 10 to 15 participants
should be adequate if their backgrounds are simi-
lar. In contrast, if a wide variety of people or
Round 4
groups or a wide divergence of opinions on the
However, when necessary, in the fourth and topic are deemed necessary, more participants need
often final round, participants are again asked to to be involved. The number of participants in Del-
review the summary statements from the preceding phi is variable, but if the number of participants is
round and to provide inputs and justifications. It is too small, they may be unable to reliably provide
imperative to note that the number of Delphi itera- a representative pooling of judgments concerning
tions relies largely on the degree of consensus the target issue. Conversely, if the number of parti-
sought by the investigators and thereby can vary cipants is too large, the shortcomings inherent in
from three to five. In other words, a general con- the Delphi technique (difficulty dedicating large
sensus about a noncritical topic may only require blocks of time, low response rates) may take
three iterations, whereas a serious issue of critical effect.
importance with a need for a high level of agree-
ment among the participants may require addi-
Analysis of Data
tional iterations. Regardless of the number of
iterations, it must be remembered that the purpose In Delphi, decision rules must be established to
of the Delphi is to sort through the ideas, impres- assemble, analyze, and summarize the judgments
sions, opinions, and expertise of the participants and insights offered by the participants. Consensus
to arrive at the core or salient information that on a topic can be determined if the returned
best describes, informs, or predicts the topic of responses on that specific topic reach a prescribed
concern. or a priori range. In situations in which rating or
rank ordering is used to codify and classify data,
the definition of consensus has been at the discre-
Subject Selection
tion of the investigator(s). One example of consen-
The proper use of the Delphi technique and the sus from the literature is having 80% of subjects’
subsequent dependability of the generated data votes fall within two categories on a 7-point scale.
rely in large part on eliciting expert opinions. The Delphi technique can employ and collect
Therefore, the selection of appropriate participants both qualitative and quantitative information.
Delphi Technique 345
Investigators must analyze qualitative data if, as Second, low response rates can jeopardize
with many conventional Delphi studies, open- robust feedback. Delphi investigators need both
ended questions are used to solicit participants’ a high response rate in the first iteration and
opinions in the first round. It is recommended a desirable response rate in the following rounds.
that a team of researchers and/or experts with Investigators need to play an active role in helping
knowledge of both the target issues and instrument to motivate participants, thus ensuring as high
development analyze the written comments. a response rate as possible.
Statistical analysis is performed in the further itera- Third, the process of editing and summarizing
tions to identify statements that achieve the desired participants’ feedback allows investigators to
level of consensus. Measures of central tendency impose their own views, which may impact parti-
(means, mode, and median) and level of dispersion cipants’ responses in later rounds. Therefore, Del-
(standard deviation and interquartile range) are phi investigators must exercise caution and
the major statistics used to report findings in the implement appropriate safeguards to prevent the
Delphi technique. The specific statistics used introduction of bias.
depend on the definition of consensus set by the Fourth, an assumption regarding Delphi partici-
investigators. pants is that their knowledge, expertise, and expe-
rience are equivalent. This assumption can hardly
be justified. It is likely that the knowledge bases of
Advantages of Using the Delphi Technique Delphi participants are unevenly distributed.
Although some panelists may have much more in-
Several components of the Delphi technique make
depth knowledge of a specific, narrowly defined
it suitable for evaluation and research problems.
topic, other panelists may be more knowledgeable
First, the technique allows investigators to gather
about a wide range of topics. A consequence of
subjective judgments from experts on problems or
this disparity may be that participants who do not
issues for which no previously researched or docu-
possess in-depth information may be unable to
mented information is available. Second, the multi-
interpret or evaluate the most important state-
ple iterations allow participants time to reflect and
ments identified by Delphi participants who have
an opportunity to modify their responses in subse-
in-depth knowledge. The outcome of such a Delphi
quent iterations. Third, Delphi encourages innova-
study could be a series of general statements rather
tive thinking, particularly when a study attempts
than an in-depth exposition of the topic.
to forecast future possibilities. Last, participant
anonymity minimizes the disadvantages often asso-
ciated with group processes (e.g., bandwagon Computer-Assisted Delphi Process
effect) and frees subjects from pressure to con-
The prevalence and application of electronic tech-
form. As a group communication process, the
nologies can facilitate the implementation of the
technique can serve as a means of gaining insight-
Delphi process. The advantages of computer-
ful inputs from experts without the requirement of
assisted Delphi include participant anonymity,
face-to-face interactions. Additionally, confidenti-
reduced time required for questionnaire and feed-
ality is enhanced by the geographic dispersion of
back delivery, readability of participant responses,
the participants, as well as the use of electronic
and the easy accessibility provided by Internet
devices such as e-mail to solicit and exchange
connections.
information.
If an e-mail version of the questionnaires is to
be used, investigators must ensure e-mail addresses
are correct, contact invited participants before-
Limitations of the Delphi Technique
hand, ask their permission to send materials via e-
Several limitations are associated with Delphi. mail, and inform the recipients of the nature of
First, a Delphi study can be time-consuming. the research so that they will not delete future e-
Investigators need to ensure that participants mail contacts. With regard to the purchase of a sur-
respond in a timely fashion because each round vey service, the degree of flexibility in question-
rests on the results of the preceding round. naire templates and software and service costs
346 Demographics
may be the primary considerations. Also, Delphi picture (graphy). Examples of demographic char-
participants need timely instructions for accessing acteristics include age, race, gender, ethnicity, reli-
the designated link and any other pertinent gion, income, education, home ownership, sexual
information. orientation, marital status, family size, health and
disability status, and psychiatric diagnosis.
Chia-Chien Hsu and Brian A. Sandford
ethnicity). It is generally agreed and advisable that See also Dependent Variable; Independent Variable
demographic information should be collected on
the basis of participant report and not as an obser- Further Readings
vation of the researcher. In the case of race, for
example, it is not uncommon for someone whom Goldberg, W. A., Prause, J., Lucas-Thompson, R., &
a researcher may classify as Black to self-identify Himsel, A. (2008). Maternal employment and
children’s achievement in context: A meta-analysis of
as White or biracial.
four decades of research. Psychological Bulletin, 134,
77–108.
Selection of Demographic Hart, D., Atkins, R., & Matsuba, M. K. (2008). The
association of neighborhood poverty with personality
Information to Be Collected change in childhood. Journal of Personality & Social
Researchers should collect only the demographic Psychology, 94, 1048–1061.
information that is necessary for the specific pur-
poses of the research. To do so, in the planning
stage researchers will need to identify demographic
information that is vital in the description of parti-
DEPENDENT VARIABLE
cipants as well as in data analysis, and also infor-
mation that will enhance interpretation of the A dependent variable, also called an outcome
results. For example, in a study of maternal variable, is the result of the action of one or
employment and children’s achievement, Wendy more independent variables. It can also be
Goldberg and colleagues found that the demo- defined as any outcome variable associated with
graphic variables of children’s age and family some measure, such as a survey. Before provid-
structure were significant moderators of the ing an example, the relationship between the
results. Thus, the inclusion of particular demo- two (in an experimental setting) might be
graphic information can be critical for an accurate expressed as follows:
understanding of the data.
DV ¼ f ðIV1 þ IV2 þ IV3 þ þ IVk Þ;
these two groups. Although several limitations Ostrosky, M. M. (2008). Preparing early child-
should be considered, results of this study indi- hood educators to address young children’s
cate that practitioners can be relatively confident social-emotional development and challenging
in using the HINT to screen infants of both ori- behavior. Journal of Early Intervention, 30(4),
gins for developmental delays. [Mayson, T. A., 321–340.]
Backman, C. L., Harris, S. & Hayes, V. E.
(2009). Motor development in Canadian infants Neil J. Salkind
of Asian and European ethnic origins. Journal of
See also Control Variables; Dichotomous Variable;
Early Intervention, 31(3), 199–214.]
Independent Variable; Meta-Analysis; Nuisance
Variable; Random Variable; Research Hypothesis
In this study, the dependent variable is motor
development as measured by the Harris Infant Further Readings
Neuromotor Test (HINT), and the independent
variables are ethnic origin (with the two categori- Luft, H. S. (2004). Focusing on the dependent variable:
cal levels of Asian origin and European origin). In Comments on ‘‘Opportunities and Challenges for
this quasi-experimental study (since participants Measuring Cost, Quality, and Clinical Effectiveness in
Health Care,’’ by Paul A. Fishman, Mark C. Hornbrook,
are preassigned), scores on the HINT are a function
Richard T. Meenan, and Michael J. Goodman. Medical
of ethnic origin. Care Research and Review, 61, 144S–150S.
In the following example, the dependent vari- Sechrest, L. (1982). Program evaluation: The independent
able is a score on a survey reflecting how well and dependent variables. Counseling Psychologist, 10,
survey participants believe that their students are 73–74.
prepared for professional work. Additional anal-
yses looked at group differences in program
length, but the outcome survey values illustrate
what is meant in this context as a dependent DESCRIPTIVE DISCRIMINANT
variable. ANALYSIS
This article presents results from a survey of fac- Discriminant analysis comprises two approaches
ulty members from 2- and 4-year higher education to analyzing group data: descriptive discriminant
programs in nine states that prepare teachers to analysis (DDA) and predictive discriminant analy-
work with preschool children. The purpose of the sis (PDA). Both use continuous (or intervally
study was to determine how professors address scaled) data to analyze the characteristics of group
content related to social-emotional development membership. However, PDA uses this continuous
and challenging behaviors, how well prepared they data to predict group membership (i.e., How accu-
believe graduates are to address these issues, and rately can a classification rule classify the current
resources that might be useful to better prepare sample into groups?), while DDA attempts to dis-
graduates to work with children with challenging cover what continuous variables contribute to the
behavior. Of the 225 surveys that were mailed, separation of groups (i.e., Which of these variables
70% were returned. Faculty members reported contribute to group differences and by how
their graduates were prepared on topics such as much?). In addition to the primary goal of discrim-
working with families, preventive practices, and inating among groups, DDA can examine the most
supporting social emotional development but less parsimonious way to discriminate between groups,
prepared to work with children with challenging investigate the amount of variance accounted for
behaviors. Survey findings are discussed related to by the discriminant variables, and evaluate the rel-
differences between 2- and 4-year programs and ative contribution of each discriminant (continu-
between programs with and without a special edu- ous) variable in classifying the groups.
cation component. Implications for personnel For example, a psychologist may be interested
preparation and future research are discussed. in which psychological variables are most respon-
[Hemmeter, M. L., Milagros Santos, R. M., & sible for men’s and women’s progress in therapy.
Descriptive Discriminant Analysis 349
For this purpose, the psychologist could collect data the researcher how or where the differences
on therapeutic alliance, resistance, transference, come from.
and cognitive distortion in a group of 50 men and This entry first describes discriminant functions
50 women who report progressing well in therapy. and their statistical significance. Next, it explains
DDA can be useful in understanding which vari- the assumptions that need to be met for DDA.
ables of the four (therapeutic alliance, resistance, Finally, it discusses the computation and interpre-
transference, and cognitive distortion) contribute to tation of DDA.
the differentiation of the two groups (men and
women). For instance, men may be low on thera-
Discriminant Functions
peutic alliance and high on resistance. On the other
hand, women may be high on therapeutic alliance A discriminant function (also called a canonical
and low on transference. In this example, the other discriminant function) is a weighted linear combi-
variable of cognitive distortion may not be shown nation of discriminant variables, which can be
to be relevant to group differentiation at all because written as
it does not capture much difference among the
groups. In other words, cognitive distortion is unre- D ¼ a þ b1 x1 þ b2 x2 þ þ bn xn þ c, ð1Þ
lated to how men and women progress in therapy.
This is just a brief example of the utility of DDA in where D is the discriminant score, a is the inter-
differentiating among groups. cept, the bs are the discriminant coefficients, the xs
DDA is a multivariate technique with goals sim- are discriminant variables, and c is a constant. The
ilar to those of multivariate analysis of variance discriminant coefficients are similar to beta
(MANOVA) and computationally identical to weights in multiple regression and maximize the
MANOVA. As such, all assumptions of MANOVA distance across the means of the grouping variable.
apply to the procedure of DDA. However, MAN- The number of discriminant functions in DDA is
OVA can determine only whether groups are dif- k 1, where k is the number of groups or cate-
ferent, not how they are different. In order to gories in the grouping variable, or the number of
determine how groups differ using MANOVA, discriminant variables, whichever is less. For
researchers typically follow the MANOVA proce- example, in the example of men’s and women’s
dure with a series of analyses of variance (ANO- treatment progress, the number of discriminant
VAs). This is problematic because ANOVAs are functions will be one because there are two groups
univariate tests. As such, several ANOVAs may and four discriminant variables, that is, min (1, 4),
need to be conducted, increasing the researcher’s 1 is less than 4. In DDA, discriminant variables
likelihood of committing Type I error (likelihood are optimally combined so that the first discrimi-
of finding a statistically significant result that is nant function provides the best discrimination
not really there). What’s more, what makes multi- across groups, the second function second best,
variate statistics more desirable in social science and so on until all possible dimensions are
research is the inherent assumption that human assessed. These functions are orthogonal or inde-
behavior has multiple causes and effects that pendent from one another so that there will be no
exist simultaneously. Conducting a series of uni- shared variance among them (i.e., no overlap of
variate ANOVAs strips away the richness that contribution to differentiation of groups). The first
multivariate analysis reveals because ANOVA discriminant function will represent the most pre-
analyzes data as if differences among groups vailing discriminating dimension, and later func-
occur in a vacuum, with no interaction among tions may also denote other important dimensions
variables. Consider the earlier example. A series of discrimination.
of ANOVAs would assume that as men and The statistical significance of each discriminant
women progress through therapy, there is no function should be tested prior to a further evalua-
potential shared variance between the variables tion of the function. Wilks’s lambda is used to
therapeutic alliance, resistance, transference, and examine the statistical significance of functions.
cognitive distortion. And while MANOVA does Wilks’s lambda varies from 0 through 1, with 1
account for this shared variance, it cannot tell denoting the groups that have the same mean
350 Descriptive Discriminant Analysis
discriminant function scores and 0 denoting those multicollinearity assumption in multiple regres-
that have different mean scores. In other words, sion. If a discriminant variable is very highly corre-
the smaller the value of Wilks’s lambda, the more lated with another discriminant variable (e.g.,
likely it is statistically significant and the better it r > .90), the variance–covariance matrix of the
differentiates between the groups. Wilks’s lambda discriminant variables cannot be inverted. Then,
is the ratio of within-group variance to the total the matrix is called ill-conditioned. Sixth, discrimi-
variance on the discrimiant variables and indicates nant variables must follow the multivariate normal
the proportion of variance in the total variance distribution, meaning that a discriminant variable
that is not accounted for by differences of groups. should be normally distributed about fixed values
A small lambda indicates the groups are well dis- of all the other discriminant variables. K. V. Mar-
criminated. In addition, 1 Wilks’s lambda is dia has provided measures of multivariate skew-
used as a measure of effect size to assess the practi- ness and kurtosis, which can be computed to
cal significance of discriminant functions as well as assess whether the combined distribution of dis-
the statistical significance. criminant variables is multivariate. Also, multivari-
ate normality can be graphically evaluated.
Seventh, DDA assumes that the variance–covari-
Assumptions
ance matrices of discriminant variables are homo-
DDA requires seven assumptions to be met. First, geneous across groups. This assumption intends to
DDA requires two or more mutually exclusive make sure that the compared groups are from the
groups, which are formed by the grouping variable same population. If this assumption is met, any
with each case belonging to only one group. It is differences in a DDA analysis can be attributed to
best practice for groups to be truly categorical in discriminant variables, but not to the compared
nature. For example, sex, ethnic group, and state groups. This assumption is analogous to the
where someone resides are all categorical. Some- homogeneity of variance assumption in ANOVA.
times researchers force groups out of otherwise The multivariate Box’s M test can be used to deter-
continuous data. For example, people aged 15 to mine whether the data satisfies this assumption.
20, or income between $15,000 and $20,000. Box’s M test examines the null hypothesis that the
However, whenever possible, preserving continu- variance–covariance matrices are not different
ous data where it exists and using categorical data across the groups compared. If the test is significant
as grouping variables in DDA is best. The second (e.g., the p value is lower than .05), the null hypoth-
assumption states there must be at least two cases esis can be rejected, indicating that the matrices are
for each group. different across the groups. However, it is known
The other five assumptions are related to dis- that Box’s M test is very sensitive to even small dif-
criminant variables in discriminant functions, as ferences in variance–covariance matrices when the
explained in the previous section. The third sample size is large. Also, because it is known that
assumption states that any number of discriminant DDA is robust against violation of this assumption,
variables can be included in DDA as long as the the p value typically is set at a much lower level,
number of discriminant variables is less than the such as .001. Furthermore, it is recognized that
sample size of the smallest group. However, it is DDA is robust with regard to violation of the
generally recommended that the sample size be assumption of multivariate normality.
between 10 and 20 times the number of discrimi- When data do not satisfy some of the assump-
nant variables. If the sample is too small, the reli- tions of DDA, logistic regression can be used as an
ability of a DDA will be lower than desired. On alternative. Logistic regression can answer the
the other hand, if the sample size is too large, sta- same kind of questions DDA answers. Also, it is
tistical tests will turn out significant even for small a very flexible method in that it can handle both
differences. Fourth, the discriminant variables categorical and interval variables as discriminant
should be interval, or at least ordinal. Fifth, the variables and data under analysis do not need to
discriminant variables are not completely redun- meet assumptions of multivariate normality and
dant or highly correlated with each other. This equal variance–covariance matrices. It is also
assumption is identical to the absence of perfect robust to unequal group size.
Descriptive Discriminant Analysis 351
centroids. The nature of the discrimination for each descriptive correlation coefficient before learning
discriminant function can be examined by looking how to use regression or multiple regression infer-
at different locations of centroids. For example, entially. Descriptive statistics are also complemen-
a certain group that has the highest and lowest tary to inferential ones in analytical practice. Even
values of centroids on a discriminant function will when the analysis draws its main conclusions from
be best discriminated on that function. an inferential analysis, descriptive statistics are
usually presented as supporting information to
Seong-Hyeon Kim and Alissa Sherry give the reader an overall sense of the direction
and meaning of significant results.
See also Multivariate Analysis of Variance (MANOVA)
Although most of the descriptive building
blocks of statistics are relatively simple, some
Further Readings descriptive methods are high level and complex.
Consider multivariate descriptive methods, that is,
Brown, M. T., & Wicker, L. R. (2000). Discriminant
statistical methods involving multiple dependent
analysis. In H. E. A. Tinsley & S. D. Brown (Eds.),
Handbook of multivariate statistics and mathematical
variables, such as factor analysis, principal compo-
modeling (pp. 209–235). San Diego, CA: Academic nents analysis, cluster analysis, canonical correla-
Press. tion, or discriminant analysis. Although each
Huberty, C. J. (1994). Applied discriminant analysis. represents a fairly high level of quantitative sophis-
New York: Wiley. tication, each is primarily descriptive. In the hands
Sherry, A. (2006). Discriminant analysis in counseling of a skilled analyst, each can provide invaluable
psychology research. The Counseling Psychologist, 5, information about the holistic patterns in data.
661–683. For the most part, each of these high-level multi-
variate descriptive statistical methods can be
matched to a corresponding inferential multivari-
ate statistical method to provide both a description
DESCRIPTIVE STATISTICS of the data from a sample and inferences to the
population; however, only the descriptive methods
Descriptive statistics are commonly encountered, are discussed here.
relatively simple, and for the most part easily The topic of descriptive statistics is therefore
understood. Most of the statistics encountered in a very broad one, ranging from the simple first
daily life, in newspapers and magazines, in televi- concepts in statistics to the higher reaches of data
sion, radio, and Internet news reports, and so structure explored through complex multivariate
forth, are descriptive in nature rather than inferen- methods. The topic also includes graphical data
tial. Compared with the logic of inferential statis- presentation, exploratory data analysis (EDA)
tics, most descriptive statistics are somewhat methods, effect size computations and meta-analy-
intuitive. Typically the first five or six chapters of sis methods, esoteric models in mathematical psy-
an introductory statistics text consist of descriptive chology that are highly useful in basic science
statistics (means, medians, variances, standard experimental psychology areas (such as psycho-
deviations, correlation coefficients, etc.), followed physics), and high-level multivariate graphical data
in the later chapters by the more complex rationale exploration methods.
and methods for statistical inference (probability
theory, sampling theory, t and z tests, analysis of
Graphics and EDA
variance, etc.)
Descriptive statistical methods are also founda- Graphics are among the most powerful types of
tional in the sense that inferential methods are descriptive statistical devices and often appear as
conceptually dependent on them and use them as complementary presentations even in primarily
their building blocks. One must, for example, inferential data analyses. Graphics are also highly
understand the concept of variance before learning useful in the exploratory phase of research, form-
how analysis of variance or t tests are used for ing an essential part of the approach known
statistical inference. One must understand the as EDA.
Descriptive Statistics 353
Figure 1 Scatterplots of Six Different Bivariate Data Configurations That All Have the Same Pearson Product-
Moment Correlation Coefficient
276 300
274 250
272 200
270 150
268 100
266 50
264 0
2007 2008 2009 2004 2005 2006 2007 2008 2009 2010
to be a strongly negative trend is more reason- much of this development, with his highly crea-
ably attributed to random fluctuation. tive graphical methods, such as the stem-and-leaf
Although the distortion just described is inten- plot and the box-and-whisker plot, as shown on
tional, similar distortions are common through the left and right, respectively, in Figure 3. The
oversight. In fact, if one enters the numerical stem-and-leaf (with the 10s-digit stems on
values from Figure 2 into a spreadsheet program the left of the line and the units-digit leaves on
and creates a bar graph, the default graph employs the right) has the advantage of being both a table
the restricted range shown in the left-hand figure. and a graph. The overall shape of the stem-and-
It requires a special effort to present the data accu- leaf plot in Figure 3 shows the positive skew in
rately. Graphics can be highly illuminating, but the the distribution, while the precise value of each
caveat is that one must use care to ensure that they data point is preserved by numerical entries. The
are not misleading. The popularity of Huff’s book box-and-whisker plots similarly show the overall
indicates that he has hit a nerve in questioning the shape of a distribution (the two in Figure 3 hav-
veracity in much of statistical presentation. ing opposite skew) while identifying the sum-
The work of Edward Tufte is also well known mary descriptive statistics with great precision.
in the statistical community, primarily for his com- The box-and-whisker plot can also be effectively
pelling and impressive examples of best practices combined with other graphs (such as attached to
in the visual display of quantitative information. the x- and the y-axes of a bivariate scatterplot)
Although he is best known for his examples of to provide a high level of convergent informa-
good graphics, he is also adept in identifying tion. These methods and a whole host of other
a number of the worst practices, such as what he illuminating graphical displays (such as run
calls ‘‘chartjunk,’’ or the misleading use of rectan- charts, Pareto charts, histograms, MultiVari
gular areas in picture charts, and the often mind- charts, and many varieties of scatterplots) have
less use of PowerPoint in academic presentations. become the major tools of data exploration.
Tukey suggested that data be considered
decomposable into rough and smooth elements
EDA
(data ¼ rough þ smooth). In a bivariate
Graphics have formed the basis of one of the relationship, for example, the regression line
major statistical developments of the past could be considered the smooth component, and
50 years: EDA. John Tukey is responsible for the deviations from regression the rough
Descriptive Statistics 355
component. Obviously, a description of the methods are often used in an attempt to deal
smooth component is of value, but one can also with inadequate data.
learn much from a graphical presentation of the The available selection of graphical descriptive
rough. statistical tools is obviously broad and varied. It
Tukey contrasted EDA with confirmatory data includes simple graphical inscription devices—such
analysis (the testing of hypotheses) and saw each things as bar graphs, line graphs, histograms, scatter-
as having its place, much like descriptive and infer- plots, box-and-whisker plots, and stem-and-leaf
ential statistics. He referred to EDA as a reliance plots, as just discussed—and also high-level ones.
on display and an attitude. Over the past century a number of highly sophisti-
cated multidimensional graphical methods have been
devised. These include principal components plots,
The Power of Graphicity multidimensional scaling plots, cluster analysis den-
Although graphical presentations can easily drograms, Chernoff faces, Andrews plots, time series
go astray, they have much potential explanatory profile plots, and generalized draftsman’s displays
power and exploratory power, and some of the (also called multiple scatterplots), to name a few.
best of the available descriptive quantitative Hans Rosling, a physician with broad interests,
tools are in fact graphical in nature. Indeed, it has created a convincing demonstration of the
has been persuasively argued, and some evidence immense explanatory power of so simple a graph
has been given, that the use of graphs in publica- as a scatterplot, using it to tell the story of eco-
tions both within psychology and across other nomic prosperity and health in the development of
disciplines correlates highly with the ‘‘hardness’’ the nations of the world over the past two centu-
of those scientific fields. Conversely, an inverse ries. His lively narration of the presentations
relation is found between hardness of subareas accounts for some of their impact, but such data
of psychology and the use of inferential statistics stories can be clearly told with, for example,
and data tables, indicating that the positive cor- a time-series scatterplot of balloons (the diameter
relation of graphicity with hardness is not due to of each representing the population size of a partic-
quantification and that perhaps inferential ular nation) floating in a bivariate space of fertility
rate (x-axis) and life expectancy (y-axis). The
time-series transformations of this picture play like
a movie, with labels of successive years (‘‘1962,’’
3 5789 ‘‘1963,’’ etc.) flashing in the background.
3 0022234
2 566667899
2 122344 Effect Size, Meta-Analysis, and
1 578 Accumulative Data Description
1 0234 Effect size statistics are essentially descriptive in
0 569 nature. They have evolved in response to a logical
0 334 gap in established inferential statistical methods.
Many have observed that the alternative hypothe-
sis is virtually always supported if the sample size
is large enough and that many published and
statistically significant results do not necessarily
represent strong relationships. To correct this
Figure 3 Stem-and-Leaf Plot (Left); Two Box-and- somewhat misleading practice, William L. Hays,
Whisker Plots (Right) in his 1963 textbook, introduced methods for cal-
Note: The box-and-whisker plots consist of a rectangle
culating effect size.
extending from the 1st quartile to the 3rd quartile, a line
across the rectangle (at the 2nd quartile, or median), and In the 30 years that followed, Jacob Cohen took
‘‘whisker’’ lines at each end marking the minimum and the lead in developing procedures for effect size
maximum values. estimation and power analysis. His work in turn
356 Descriptive Statistics
led to the development of meta-analysis as an some types of factor analysis, cluster analysis, multi-
important area of research—comparisons of the dimensional scaling, discriminant analysis, and
effect sizes from many studies, both to properly canonical correlation, are essentially descriptive in
estimate a summary effect size value and to assess nature. Each is conceptually complex, useful in
and correct bias in accumulated work. That is, a practical sense, and mathematically interesting.
even though the effect size statistic itself is descrip- They provide clear examples of the farther reaches
tive, inferential data-combining methods have been of sophistication within the realm of descriptive
developed to estimate effect sizes on a population statistics.
level.
Another aspect of this development is that the
recommendation that effect sizes be reported has Principal Components
begun to take on a kind of ethical force in con- Analysis and Factor Analysis
temporary psychology. In 1996, the American Factor analysis has developed within the disci-
Psychological Association Board of Scientific pline of psychology over the past century in close
Affairs appointed a task force on statistical infer- concert with psychometrics and the mental testing
ence. Its report recommended including effect movement, and it continues to be central to psycho-
size when reporting a p value, noting that report- metric methodology. Factor analysis is in fact not
ing and analyzing effect size is imperative to one method but a family of methods (including
good research. principal components) that share a common core.
The various methods range from entirely descrip-
tive (principal components, and also factor analysis
Multivariate Statistics and Graphics
by the principal components method) to inferential
Many of the commonly used multivariate statisti- (common factors method, and also maximum like-
cal methods, such as principal components analysis, lihood method). Factors are extracted by the
4
Height
2
Harry
Ralph
0
Josiah
Samuel
Leonard
Francis
Frank
Lewis
Charles
Clarence
Elmer
Jacob
Eugene
Patrick
Archie
Isaac
Alexander
Arthur
Roy
Willis
Laurence
Alfred
Walter
Frederick
Jesse
Michael
David
Albert
Theodore
Christopher
Earnest
Marion
George
Alonzo
August
Edgar
Hiram
Edward
Nelson
Harvey
Hugh
Abraham
Oliver
Ira
Robert
Elilan
Oscar
James
Simon
Joel
Jack
Carl
Paul
Harrison
Peter
Reuben
Mark
Luther
Silas
John
Benjamin
Edmond
Aaron
Philip
Eli
Allan
Asa
Alvin
William
Rufus
Daniel
Horace
Solomon
Jeremiah
Wesley
Raymond
Joseph
Elijah
Matthew
Elias
Andrew
Adam
Amos
Thomas
Josuah
Richard
Moses
Howard
Earl
Edwin
Calvin
Nathan
Levi
Warren
Martin
Henry
Milton
Stephen
Herbert
Herman
dist(names100)
hclust (*,“complete”)
Figure 4 Cluster Analysis Dendrogram of the Log-Frequencies of the 100 Most Frequent Male Names in the United
States in the 19th Century
Descriptive Statistics 357
maximum likelihood method to account for as principal components method of factor analysis are
much variance as possible in the population corre- most often employed, for descriptive ends, in creat-
lation matrix. Principal components and the ing multivariate graphics.
3.5
3.5
3.0 Names:
3.0
Names: Raymond
2.5
2.5 John Earl
William 2.0 Clarence
2.0 James Paul
George 1.5 Howard
1.5 Charles Herbert
Joseph 1.0 Herman
1.0
Thomas Elmer
Henry 0.5
0.5 Carl
Emest
0.0
0.0
1801–1810
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
1801–1810
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
Cluster 1: High Frequency, log3 Across the Century Cluster 2: Zero to log2, Emerges in 5th to 8th Decade
Peter
Log of Name Frequency
Harry Andrew
2.5 Walter 2.5 Richard
Arthur Alexander
2.0 Jesse
2.0 Laurence Alfred
Marion Francis
1.5 Michael
1.5 Eugene
Martin
Patrick 1.0 Edwin
1.0 Ralph Frank
Oscar 0.5 Fredrick
0.5 Albert
Edgar Lewis
0.0 Robert
0.0 Edward
1801–1810
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
1801–1810
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
Samuel
David
Daniel
Benjamin
Cluster 3: Zero to log2, Emerges in 2nd to 5th Decade Cluster 4: Medium Frequency, log2 Across the Century
Names: 3.5
3.5 Josiah Names:
Joel Harrison
Asa Christopher
3.0 3.0
Elisha Simon
Alonzo
Log of Name Frequency
Joshua
Log of Name Frequency
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
1801–1810
1811–1820
1821–1830
1831–1840
1841–1850
1851–1860
1861–1870
1871–1880
1881–1890
1891–1900
Abraham Ira
Nathan
Levi
Moses
Hiram
Elijah
Cluster 5: Drop From log2 to log1 Across the Century Cluster 6: Low Frequency, log1.3 Across the Century
Figure 5 Semantic Space of Male Names Defined by the Vectors for the 10 Decades of the 19th Century
358 Descriptive Statistics
1810
1820
1830
tightly clustered Y names for each decade (columns)
40
18
for each of the top 100 male
1800
50
names (rows). Figure 4 is a cluster
18
60
18 analysis dendrogram of these
187
0 names, revealing six clusters.
Each of the six clusters is shown
as a profile plot in Figure 5, with
1880
Z X a collage of line plots tracing the
1890
trajectory of each name within the
cluster. Clearly, the cluster analysis
last five decades, separates the groups well. Figure 6
more spread out
is a vector plot from a factor analy-
sis, in which two factors account
for 93.3% of the variance in the
name frequency pattern for ten dec-
ades. The vectors for the first three
or four decades of the century are
essentially vertical, with the remain-
Figure 6 Location of the 100 Most Frequent 19th Century Male ing decades fanning out sequentially
Names Within the Semantic Space of 10 Decades to the right and with the final
decade flat horizontally to the right.
Other Multivariate Methods A scatterplot (not shown here) of the 100 names
within this same two-factor space reveals that the
A number of other multivariate methods are names group well within the six clusters, with vir-
also primarily descriptive in their focus and can be tually no overlap among clusters.
effectively used to create multivariate graphics, as
several examples will illustrate. Cluster analysis is Bruce L. Brown
a method for finding natural groupings of objects
See also Bar Chart; Box-and-Whisker Plot; Effect Size,
within a multivariate space. It creates a graphical
Measures of; Exploratory Data Analysis; Exploratory
representation of its own, the dendrogram, but it
Factor Analysis; Mean; Median; Meta-Analysis;
can also be used to group points within a scatter-
Mode; Pearson Product-Moment Correlation
plot. Discriminant analysis can be used graphically
Coefficient; Residual Plot; Scatterplot; Standard
in essentially the same way as factor analysis and
Deviation; Variance
principal components, except that the factors are
derived to maximally separate known groups
Further Readings
rather than to maximize variance. Canonical cor-
relation can be thought of as a double factor anal- Brown, B. L. (in press). Multivariate analysis for the bio-
ysis in which the factors from an X set of variables behavioral and social sciences. New York: Wiley.
are calculated to maximize their correlation with Cudeck, R., & MacCallum, R. C. (2007). Preface. In R.
corresponding factors from a Y set of variables. As Cudeck & R. C. MacCallum (Eds.), Factor analysis at
such, it can form the basis for multivariate graphi- 100: Historical developments and future directions.
cal devices for comparing entire sets of variables. Mahwah, NJ: Lawrence Erlbaum.
Huff, D. (1954). How to lie with statistics. New York:
Norton.
Multivariate Graphics Kline, P. (1993). The handbook of psychological testing.
London: Routledge.
A simple example, a 100 × 10 matrix of name Smith, L. D., Best, I. A., Stubbs, A., Archibald, A. B., &
frequencies, illustrates several multivariate graphs. Roberson-Nay, R. (2002). Constructing knowledge:
Dichotomous Variable 359
The role of graphs and tables in hard and soft Constructed Dichotomous Variables
psychology. American Psychologist, 57(10), 749–761.
Tufte, E. R. (2001). The visual display of quantitative Dichotomous variables may be constructed on
information (2nd ed.). Cheshire, CT: Graphics Press. the basis of conceptual rationalizations regarding
Tukey, J. W. (1980). We need both exploratory and the variables themselves or on the basis of the dis-
confirmatory. American Statistician, 34(1), 23–25. tribution of the variables in a particular study.
events. The recoding of a variable with a range of (e.g., true/false or male/female) or may be assigned
values into a dichotomous variable may be done randomly by the researcher to address a range of
intentionally for a particular analysis, with the research issues (e.g., sought treatment vs. did not
original values and range of the variable main- seek treatment or sought treatment between zero
tained in the data set for further analysis. and two times vs. sought treatment three or
more times). How a dichotomous variable is
conceptualized and constructed within a research
Implications for Statistical Analysis
design (i.e., as an independent or a dependent
The role of the dichotomous variable within the variable) will affect the type of analyses appro-
research design (i.e., as an independent or depen- priate to interpret the variable and describe,
dent variable), as well as the nature of the sample explain, or predict based in part on the role of
distribution (i.e., normally or nonlinearly distrib- the dichotomous variable. Because of the arbi-
uted), influences the type of statistical analyses that trary nature of the value assigned to the dichoto-
should be used. mous variable, it is imperative to consult the
descriptive statistics of a study in order to fully
interpret findings.
Dichotomous Variables as Independent Variables
In the prototypical experimental or quasi-exper- Mona M. Abo-Zena
imental design, the dependent variable represents
See also Analysis of Variance (ANOVA); Correlation;
behavior that researchers measure. Depending on
Covariate; Logistic Regression; Multiple Regression
the research design, a variety of statistical proce-
dures (e.g., correlation, linear regression, and anal-
yses of variance) can explore the relationship Further Readings
between a particular dependent variable (e.g., Budesco, D. V. (1985). Analysis of dichotomous variables
school achievement) and a dichotomous variable in the presence of serial dependence. Psychological
(e.g., the participant’s sex or participation in a par- Bulletin, 73(3), 547–561.
ticular enrichment program). How the dichoto- Meyers, L. S., Gamst, G., & Guarnino, A. J. (2006).
mous variable is accounted for (e.g., controlled for, Applied multivariate research: Design and
blocked) will be dictated by the particular type of interpretation. Thousand Oaks, CA: Sage.
analysis implemented.
Item i Item i
Latent Trait
Terminology
DIF analysis typically compares two groups:
a focal group and a reference group. The focal
Figure 3 An illustration of Differential Item group is defined as the main group of interest,
Functioning whereas the reference group is a group used for
comparison purposes. The statistical methodol-
ogy of DIF assumes that one controls for the
measures something in addition to the latent trait
trait or ability levels between these two groups.
that is differentially related to the group variable.
Most research uses the term ability level for
This is shown in Figure 3.
either ability levels or trait levels, even though in
The remainder of this entry defines DIF and its
specific situations one term might be more
related terminology, describes the use of DIF for
precise. The ability level is used to match sub-
polytomous outcomes, and discusses the assess-
jects from the two groups so that the effect of
ment and measurement of DIF.
ability is controlled. Thus, by controlling for
ability level, one may detect group differences
that are not confounded by the ability. This abil-
Definition of DIF
ity level is aptly referred to as the matching
DIF is one way to consider the different impact criterion.
an item may have on various subpopulations. The matching criterion might be one of many
One could consider DIF as the statistical mani- different indices of interest, yet typically the
festation of bias, but not the social aspect. An total test performance or some estimate of trait
item is said to show DIF when subjects from two levels (as in the case of attitudinal measures) is
subpopulations have different expected scores on used. In some instances, an external measure
the same item after controlling for ability. Using might be used as the matching criterion if it can
item response theory (IRT) terminology, if be shown that the measure is appropriate to
a non-DIF item has the same item response func- account for the ability levels of the groups of
tion between groups, then subjects having the interest. In addressing the issue of using test
same ability would have equal probability of scores as the matching criterion, the matching
Differential Item Functioning 363
1.0
1.0
0.8 0.8
Probability
0.6
Probability
0.6
0.4 0.4
0.2 0.2
0.0 0.0
Group A
−3 −2 −1 0 1 2 3
Group A −3 −2 −1 0 1 2 3 Group B
Group B
Theta Theta
criterion should be free of DIF items. This can data, the consideration of DIF is more simplistic
be problematic in the typical case in which the as there are only two outcomes. But for polyto-
items undergoing DIF analysis are the very items mous outcomes, there is a possibility of an inner-
that form the matching criterion. In such a response DIF (IDIF). That is, there is the possi-
situation, the matching criterion should undergo bility that DIF does not exist uniformly across
a ‘‘purification’’ process, in which a preliminary all response categories but may exist for certain
DIF analysis is performed to rid the matching responses within that item. Figure 6 illustrates
criterion of any DIF items. an example in which a particular 4-point Likert-
The phrase uniform DIF refers to a type of type item displays DIF on lower ordinal
DIF in which the magnitude of group difference responses but not on higher ordinal responses.
is the same across ability levels. Using IRT ideas, This type of DIF can be referred to as a lower
uniform DIF occurs when there is no interaction IDIF. This can exist, as an illustration, when the
between group and item characteristic curves, as focal group tends to differentially vary in suc-
represented in Figure 4. In contrast, the phrase cessfully scoring lower ordinal scores on an atti-
nonuniform DIF refers to a type of DIF in which tudinal measurement as compared to a reference
the magnitude of the group difference is not con- group, while both groups have similar success in
sistent across ability levels. From an IRT per- upper ordinal scoring categories.
spective, nonuniform DIF would result in Figure 7 illustrates a balanced IDIF, in which
crossing item characteristic curves. This is illus- the nature of DIF changes for both extreme ordi-
trated in Figure 5. Nonuniform DIF can be nal responses.
thought of as an interaction effect between the In this example, there is potential bias against
group and the ability level. women on the lower ordinal responses, and poten-
tial bias against men on the upper responses.
Other types of IDIF patterns are possible. For
DIF for Polytomous Outcomes
example, upper IDIF would indicate potential bias
Although traditional DIF procedures involve on the upper ordinal responses, while consistent
dichotomously scored items, DIF can also be IDIF would indicate that the DIF effect is approxi-
considered for polytomously scored data (e.g., mately the same for all ordinal responses. Patterns
Likert-type scales). Polytomously scored data in IDIF are not always present, however. In some
have the additional consideration that subjects situations, IDIF may be present only between cer-
can respond to or be labeled with more than two tain ordinal responses and not others, with no dis-
categories on a given item. For dichotomous cernible pattern.
364 Differential Item Functioning
“Disagree” “Disagree”
“Agree” “Agree”
Figure 6 An Illustration of Lower IDIF for Figure 7 An Illustration of Balanced IDIF for
Polytomous Outcomes Polytomous Outcomes
See also Item Analysis; Item Response Theory • There is a positive relationship between the
number of books read by children and the
children’s scores on a reading test.
Further Readings • Teenagers who attend tutoring sessions will
make higher achievement test scores than
Angoff, W. H. (1972, September). A technique for the
comparable teenagers who do not attend
investigation of cultural differences. Paper presented at
the annual meeting of the American Psychological
tutoring sessions.
Association, Honolulu, HI.
Holland, P. W., & Thayer, D. T. (1988). Differential item Nondirectional and Null Hypotheses
performance and the Mantel-Haenszel procedure. In
H. Wainer & H. I. Braun (Eds.), Test validity (pp. In order to fully understand a directional hypothe-
129–145). Hillsdale, NJ: Lawrence Erlbaum. sis, there must also be a clear understanding of
Kamata, A., & Binici, S. (2003). Random-effect DIF a nondirectional hypothesis and null hypothesis.
analysis via hierarchical generalized linear models.
Paper presented at the annual meeting of the
Psychometric Society, Sardinia, Italy.
Nondirectional Hypothesis
Lord, F. M. (1980). Applications of item response theory
to practical testing problems. Hillsdale, NJ: Lawrence A nondirectional hypothesis differs from a direc-
Erlbaum. tional hypothesis in that it predicts a change, rela-
Williams, V. S. L. (1997). The ‘‘unbiased’’ anchor: tionship, or difference between two variables but
Bridging the gap between DIF and item bias. Applied
does not specifically designate the change, relation-
Measurement in Education, 10, 253–267.
ship, or difference as being positive or negative.
Another difference is the type of statistical test that
is used. An example of a nondirectional hypothesis
would be the following: For (Population A), there
DIRECTIONAL HYPOTHESIS will be a difference between (Independent Variable
1) and (Independent Variable 2) in terms of
A directional hypothesis is a prediction made by (Dependent Variable 1). The following are other
a researcher regarding a positive or negative examples of nondirectional hypotheses:
change, relationship, or difference between two
variables of a population. This prediction is typi- • There is a relationship between the number of
cally based on past research, accepted theory, books read by children and the children’s scores
extensive experience, or literature on the topic. on a reading test.
Key words that distinguish a directional hypothesis • Teenagers who attend tutoring sessions will have
are: higher, lower, more, less, increase, decrease, achievement test scores that are significantly
positive, and negative. A researcher typically devel- different from the scores of comparable
ops a directional hypothesis from research ques- teenagers who do not attend tutoring sessions.
tions and uses statistical methods to check the
validity of the hypothesis.
Null Hypothesis
Statistical tests are not designed to test a direc-
Examples of Directional Hypotheses
tional hypothesis or nondirectional hypothesis, but
A general format of a directional hypothesis would rather a null hypothesis. A null hypothesis is a pre-
be the following: For (Population A), (Independent diction that there will be no change, relationship,
Variable 1) will be higher than (Independent Vari- or difference between two variables. A null
able 2) in terms of (Dependent Variable). For hypothesis is designated by Ho. An example of
example, ‘‘For ninth graders in Central High a null hypothesis would be the following: for
School, test scores of Group 1 will be higher than (Population A), (Independent Variable 1) will not
test scores of Group 2 in terms of Group 1 receiv- be different from (Independent Variable 2) in terms
ing a specified treatment.’’ The following are other of (Dependent Variable). The following are other
examples of directional hypotheses: examples of null hypotheses:
366 Directional Hypothesis
• There is no relationship between the number of When one is performing a statistical test for
books read by children and the children’s scores significance, the null hypothesis is tested to
on a reading test. determine whether there is any significant
• Teenagers who attend tutoring sessions will amount of change, difference, or relationship
make achievement test scores that are equivalent between the two variables. Before the test is
to those of comparable teenagers who do not
administered, the researcher chooses a signifi-
attend tutoring sessions.
cance level, known as an alpha level, designated
by α. In studies of education, the alpha level is
Statistical Testing of Directional Hypothesis often set at .05 or α ¼ .05. A statistical test of
the appropriate variable will then produce a p
A researcher starting with a directional hypothesis value, which can be understood as the probabil-
will have to develop a null hypothesis for the pur- ity a value as large as or larger than the statisti-
pose for running statistical tests. The null hypothe- cal value produced by the statistical test would
sis predicts that there will not be a change or have been found by chance if the null hypothesis
relationship between variables of the two groups were true. The p value must be smaller than the
or populations. The null hypothesis is designated predetermined alpha level to be considered sta-
by H0, and a null hypothesis statement could be tistically significant. If no significance is found,
written as H0 : μ1 ¼ μ2 (Population or Group 1 then the null hypothesis is accepted. If there is
equals Population or Group 2 in terms of the a significant amount of change according to the
dependent variable). A directional hypothesis or p value between two variables which cannot be
nondirectional hypothesis would then be consid- explained by chance, then the null hypotheses is
ered to be an alternative hypothesis to the null rejected, and the alternative hypothesis is
hypothesis and would be designated as H1. Since accepted, whether it is a directional or a nondi-
the directional hypothesis is predicting a direction rectional hypothesis.
of change or difference, it is designated as The type of alternative hypothesis, directional
H1 : μ1 > μ2 or H1 : μ1 < μ2 (Population or or nondirectional, makes a considerable difference
Group 1 is greater than or less than Population or in the type of significance test that is run. A nondi-
Group 2 in terms of the dependent variable). In rectional hypothesis is used when a two-tailed test
the case of a nondirectional hypothesis, there of significance is run, and a directional hypothesis
would be no specified direction, and it could be when a one-tailed test of significance is run. The
designated as H1 : μ1 6¼ μ2 (Population or Group reason for the different types of testing becomes
1 does not equal Population or Group 2 in terms apparent when examining a graph of a normalized
of the dependent variable). curve, as shown in Figure 1.
H0 H0
H1 H1 H1
H1 : µ1 > µ2 H1 : µ1 > µ2
The nondirectional hypothesis, since it pre- applications (9th ed.). Upper Saddle River, NJ:
dicts that the change can be greater or lesser Pearson Education.
than the null value, requires a two-tailed test of Moore, D. S., & McMabe, G. P. (1993). Introduction to
significance. On the other hand, the directional the practice of statistics (2nd ed.). New York: W. H.
Freeman.
hypothesis in Figure 1 predicts that there will be
Patten, M. L. (1997). Understanding research
a significant change greater than the null value; methods: An overview of the essentials. Los
therefore, the negative area of significance of the Angeles: Pyrczak.
curve is not considered. A one-tailed test of sig-
nificance is then used to test a directional
hypothesis.
DISCOURSE ANALYSIS
Summary Examples of Hypothesis Type
Discourse is a broadly used and abstract term
The following is a back-to-back example of the that is used to refer to a range of topics in vari-
directional, nondirectional, and null hypothesis. ous disciplines. For the sake of this discussion,
In reading professional articles and test hypothe- discourse analysis is used to describe a number
ses, one can determine the type of hypothesis as of approaches to analyzing written and spoken
an exercise to reinforce basic knowledge of language use beyond the technical pieces of lan-
research. guage, such as words and sentences. Therefore,
discourse analysis focuses on the use of language
Directional Hypothesis: Women will have higher
within a social context. Embedded in the con-
scores than men will on Hudson’s self-esteem scale.
structivism–structuralism traditions, discourse
Nondirectional Hypothesis: There will be analysis’s key emphasis is on the use of language
a difference by gender in Hudson’s self-esteem scale in social context. Language in this case refers to
scores. either text or talk, and context refers to the
Null Hypothesis: There will be no difference social situation or forum in which the text or
between men’s scores and women’s scores on talk occurs. Language and context are the two
Hudson’s self-esteem scale. essential elements that help distinguish the two
major approaches employed by discourse
Ernest W. Brewer and Stephen Stockton analysts. This entry discusses the background
See also Alternative Hypotheses; Nondirectional
and major approaches of discourse analysis and
Hypotheses; Null Hypothesis; One-Tailed Test;
frameworks associated with sociopolitical
p Value; Research Question; Two-Tailed Test
discourse analysis.
and professional sciences can be found in psy- social construction of discursive practices that
chology, sociology, cultural studies, and linguis- maintain the social context. This approach empha-
tics. The tradition of discourse analysis is often sizes social context as influenced by language.
listed under interpretive qualitative methods and Sociopolitical methodologists focus on social con-
is categorized by Thomas A. Schwandt with her- text and the interplay between social context and
meneutics and social construction under the con- language. This approach is most often found in the
structivist paradigm. Jaber F. Gubrium and social and professional and applied sciences, where
James A. Holstein place phenomenology in the researchers using sociopolitical discourse analysis
same vein as naturalistic inquiry and ethnometh- often employ one of two specific frameworks: Fou-
odology. The strong influence of the German and cualdian discourse analysis and critical discourse
French philosophical traditions in psychology, analysis (CDA).
sociology, and linguistics has made this a com-
mon method in the social and applied and pro-
fessional sciences. Paradigmatically, discourse
analysis assumes that there are multiple con- Sociopolitical Discourse Analysis Frameworks
structed realities and that the goal of researchers
Foucauldian Discourse Analysis
working within this perspective is to understand
the interplay between language and social con- Michel Foucault is often identified as the key
text. Discourse analysis is hermeneutic and phe- figure in moving discourse analysis beyond lin-
nomenological in nature, emphasizing the guistics and into the social sciences. The works
lifeworld and meaning making through the use of Foucault emphasize the sociopolitical
of language. This method typically involves an approach to discourse analysis. Foucault empha-
analytical process of deconstructing and critiqu- sizes the role of discourse as power, which
ing language use and the social context of lan- shifted the way discourse is critically analyzed.
guage usage. Foucault initially identified the concept of arche-
ology as his methodology for analyzing dis-
course. Archeology is the investigation of
Two Major Approaches
unconsciously organized artifacts of ideas. It is
Discourse analysis can be divided into two major a challenge to the present-day conception of his-
approaches: language-in-use (or socially situated tory, which is a history of ideas. Archeology is
text and talk) and sociopolitical. The language- not interested in establishing a timeline or Hege-
in-use approach is concerned with the micro lian principles of history as progressive. One
dimensions of language, grammatical structures, who applies archeology is interested in dis-
and how these features interplay within a social courses, not as signs of a truth, but as the discur-
context. Language-in-use discourse analysis sive practices that construct objects of
focuses on the rules and conventions of talk and knowledge. Archeology identifies how discourses
text within a certain a context. This approach of knowledge objects, separated from a histori-
emphasizes various aspects of language within cal-linear progressive structure, are formed.
social context. Language-in-use methodologists Therefore, archeology becomes the method of
focus on language and the interplay between lan- investigation, contradictory to the history of
guage and social context. Language-in-use is ideas, used when looking at an object of knowl-
often found in the disciplines of linguistics and edge; archeology locates the artifacts that are
literature studies and is rarely used in social and associated with the discourses that form objects
human sciences. of knowledge. Archeology is the how of Fou-
The second major approach, sociopolitical, is cauldian discourse analysis of the formation of
the focus of the rest of this entry because it is most an object of knowledge. Archeology consists of
commonly used within the social and human three key elements: delimitation of authority
sciences. This approach is concerned with how (who gets to speak about the object of knowl-
language forms and influences the social context. edge?), surface of emergence (when does dis-
Sociopolitical discourse analysis focuses on the course about an object of knowledge begin?),
Discourse Analysis 369
and grids of specification (how the object of data collection, and genealogy is the critical
knowledge is described, defined, and labeled). analysis of the data. These two concepts are not
However, Foucault’s archeology then suggests fully distinguishable, and a genealogy as Fou-
a power struggle within the emergence of one or cault defines it cannot exist without the method
more discourses, via the identification of author- of archeology. Foucualt’s work is the foundation
ities of delimitation. Archeology’s target is to of much of the sociopolitical discourse analysis
deconstruct the history of ideas. The only way to used in contemporary social and applied and
fully deconstruct the history of an idea is to cri- professional sciences. Many discourse studies
tique these issues of power. Hence, the creation cite Foucault as a methodological influence or
of genealogy, which allows for this critique of use specific techniques or strategies employed by
power, with use of archeology, becomes the Foucault.
method of analysis for Foucault. Foucault had to
create a concept like genealogy, since archeol-
CDA
ogy’s implied power dynamic and hints of a cri-
tique of power are in a form of hidden power. CDA builds on the critique of power high-
The term genealogy refers to the power relations lighted by Foucault and takes it a step further.
rooted in the construction of a discourse. Gene- Teun A.van Dijk has suggested that the central
alogy focuses on the emergence of a discourse focus of CDA is the role of discourse in the
and identifies where power and politics surface (re)production and challenge of dominance.
in the discourse. Genealogy refers to the union CDA’s emphasis on the role of discourse in domi-
of erudite knowledge and local memories, which nance specifically refers to social power enacted
allows us to establish a historical knowledge of by elites and institutions’ social and political
struggles and to make use of this knowledge tac- inequality through discursive forms. The produc-
tically today. Genealogy focuses on local, discon- tion and (re)production of discursive formation
tinuous, disqualified, illegitimate knowledge of power may come in various forms of dis-
opposed to the assertions of the tyranny of total- course and power relations, both subtle and
izing discourses. Genealogy becomes the way we obvious. Therefore, critical discourse analysts
analyze the power that exists in the subjugated focus on social structures and discursive strate-
discourses that we find through the use of arche- gies that play a role in the (re)production of
ology. So genealogy is the exploration of the power. CDA’s critical perspective is influenced
power that develops the discourse, which con- not only by the work of Foucault but also by the
structs an object of knowledge. The three key philosophical traditions of critical theorists, spe-
elements of genealogy include subjugated dis- cifically Jurgen Habermas.
courses (whose voices were minimized or hidden Norman Fairclough has stated that discourse
in the formation of the object of knowledge?), is shaped and constrained by social structure and
local beliefs and understandings (how is the culture. Therefore he proposes three central
object of knowledge perceived in the social tenets of CDA: social structure (class, social sta-
context?), and conflict and power relations tus, age, ethnic identity, and gender); culture
(where are the discursive disruptions and the (accepted norms and behaviors of a society); and
enactments of power in the discourse?). Archeol- discourse (the words and language we use). Dis-
ogy suggests that there is a type of objectivity course (the words and language we use) shapes
that indicates a positivistic concept of neutrality our role and engagement with power within
to be maintained when analyzing data. While a social structure. CDA emphasizes when look-
genealogy has suggestions of subjectivity, local- ing at discourse three levels of analysis: the text,
isms, and critique, much like postmodernist or the discursive practice, and the sociocultural
critical theory, archeology focuses on how dis- practice.The text is a record of a communicated
courses form an object of knowledge. Genealogy event that reproduces social power. Discursive
becomes focused on why certain discourses are practices are ways of being in the world that sig-
dominant in constructing an object of knowl- nify accepted social roles and identities. Finally,
edge. Therefore, archeology is the method of the sociocultural comprises the distinct context
370 Discriminant Analysis
where discourse occurs. The CDA approach of discriminant analysis is to find optimal combi-
attempts to link text and talk with the underly- nations of predictor variables, called discriminant
ing power structures in society at a sociopolitical functions, to maximally separate previously
level through discursive practices. Text and talk defined groups and make the best possible predic-
are the description of communication that occurs tions about group membership. Discriminant anal-
within a social context that is loaded with power ysis has become a valuable tool in social sciences
dynamics and structured rules and practices of as discriminant functions provide a means to clas-
power enactment. When text is not critically ana- sify a case into the group that it mostly resembles
lyzed, oppressive discursive practices, such as mar- and help investigators understand the nature of
ginalization and oppression, are taken as accepted differences between groups. For example, a college
norms. Therefore, CDA is intended to shine a light admissions officer might be interested in predicting
on such oppressive discursive practices. Discourse whether an applicant, if admitted, is more likely to
always involves power, and the role of power in succeed (graduate from the college) or fail (drop
a social context is connected to the past and the out or fail) based on a set of predictor variables
current context, and can be interpreted differently such as high school grade point average, scores on
by different people due to various personal back- the Scholastic Aptitude Test, age, and so forth. A
grounds, knowledge, and power positions. There- sample of students whose college outcomes are
fore there is not one correct interpretation, but known can be used to create a discriminant func-
a range of appropriate and possible interpreta- tion by finding a linear combination of predictor
tions. The correct critique of power is not the vital variables that best separates Groups 1 (students
point of CDA, but the process of critique and its who succeed) and 2 (students who fail). This dis-
ability to raise consciousness about power in social criminant function can be used to predict the col-
context is the foundation of CDA. lege outcome of a new applicant whose actual
group membership is unknown. In addition, dis-
Bart Miles criminant functions can be used to study the
nature of group differences by examining which
Further Readings predictor variables best predict group membership.
Fairclough, N. (2000). Language and power (2nd ed.). For example, which variables are the most power-
New York: Longman. ful predictors of group membership? Or what pat-
Foucault, M. (1972). The archaeology of knowledge (A. tern of scores on the predictor variables best
M. Sheridan Smith, Trans.). London: Tavistock. describes the differences between groups? This
Schwandt, T. (2007). Judging interpretations. New entry discusses the data considerations involved in
Directions for Evaluation, 114, 11–15. discriminant analysis, the derivation and interpre-
Stevenson, C. (2004). Theoretical and methodological
tation of discriminant functions, and the process
approaches in discourse analysis. Nurse Researcher,
of classifying a case into a group.
12(2), 17–29.
Taylor, S. (2001). Locating and conducting discourse
analytic research. In M. Wetherell, S. Taylor, and S. J. Data Considerations of Discriminant Analysis
Yates (Eds.), Discourse as data: A guide for analysis
(pp. 5–48). London: Sage. First of all, the predictor variables used to create
van Dijk, T. A. (1999). Critical discourse analysis and discriminant functions must be measured at the
conversation analysis. Discourse & Society, 10(4), interval or ratio level of measurement. The shape
459–460. of the distribution of each predictor variable
should correspond to a univariate normal distribu-
tion. That is, the frequency distribution of each
predictor variable should be approximately bell
DISCRIMINANT ANALYSIS shaped. In addition, multivariate normality of pre-
dictor variables is assumed in testing the signifi-
Discriminant analysis is a multivariate statistical cance of discriminant functions and calculating
technique that can be used to predict group mem- probabilities of group membership. The assump-
bership from a set of predictor variables. The goal tion of multivariate normality is met when each
Discriminant Analysis 371
transformed to a statistic that has a chi-square might be misleading when one attempts to evalu-
distribution, its statistical significance can be ate the relative importance of predictor vari-
tested. A significant Wilks’s lambda indicates ables. This is because when standard deviations
that the group means calculated from the dis- are not the same across predictor variables, one
criminant analysis are significantly different and unit change in the value of a variable varies from
therefore the discriminant function works well one variable to another. Therefore, standardized
in discriminating among groups. discriminant coefficients are needed. Standard-
ized discriminant coefficients indicate the
relative importance of measured variables in cal-
Interpreting Discriminant Functions culating discriminant scores. Standardized dis-
criminant coefficients involve adjusting the
If a discriminant function is found to be signifi- unstandardized discriminant coefficients by the
cant, one might be interested in discovering how variance of the raw scores on each predictor var-
groups are separated along the discriminant func- iable. Standardized discriminant coefficients
tion and which predictor variables are most useful would be obtained if the original data were con-
in separating groups. To visually inspect how well verted to standard form, in which each variable
groups are spaced out along discriminant function, has a mean of zero and standard deviation of
individual discriminant scores and group centroids one, and then used to optimize the discriminant
can be plotted along the axes formed by discrimi- coefficients. However, the standardized discrimi-
nant functions. The mean of discriminant scores nant coefficients can be derived from unstan-
within a group is known as the group centroid. If dardized coefficients directly by the following
the centroids of two groups are well separated and formula:
there is no obvious overlap of the individual cases
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
along a discriminant function, then the discrimi-
SSk ;
nant function separates the two groups well. If sik ¼ dik
group centroids are close to each other and indi- Ng
vidual cases overlap a great deal, the discriminant
function fails to provide a clear separation of the where sik and dik are the standardized and
two groups. When there are only one or two sig- unstandardized coefficients for predictor variable
nificant discriminant functions, the location of k on discriminant function i, respectively, SSk is
group centroids and data cases can be easily plot- the sum of squares associated with variable k, N
ted. However, when there are more than two dis- is the total number of cases, and g is the number
criminant functions, it will be visually difficult to of groups. The predictor variable associated with
locate the group centroids and data cases. There- the largest standardized coefficient (in absolute
fore, only pairwise plots of discriminant functions value) contributes most to determining scores on
are used. However, the plot based on the first two the discriminant function and therefore plays the
discriminant functions is expected to be most most important role in separating groups. It
informative because these two functions are the should be noted that when predictor variables
most powerful discriminators of groups. are correlated, the associated discriminating
The relative importance of the contribution of coefficients might provide misleading results. For
each variable to the separation of the groups can example, consider two correlated variables that
be evaluated by examining the discriminant coef- have rather small contributions to the discrimi-
ficients. At this stage, however, the discriminant nant function. The two estimated standardized
function can be considered an unstandardized coefficients might be large but with opposite
equation in the sense that raw scores on signs, so that the effect of one variable is, to
predictor variables are used to produce discrimi- some degree, canceled by the opposite effect of
nant scores. Although the magnitudes of unstan- the other variable. However, this could be misin-
dardized coefficients indicate the absolute terpreted as both variables having relatively
contribution of a predictor variable in determin- large contributions to the discriminant function
ing the discriminant score, this information but in different directions.
Discriminant Analysis 373
A better guide to the meaning of the discrimi- used to assign group membership. That is, a case
nant function is to use structure coefficients. Struc- will be classified into the group to which it has
ture coefficients look at the correlations between the highest probability of belonging. In addition,
the discriminant function and each predictor vari- the probabilities of group membership serve as
able. The variable that correlates most highly with an indicator of the discriminating power of dis-
the discriminant function shares the greatest criminant functions. For example, discriminant
amount of variance with the discriminant function functions are considered to function well when
and, therefore, explains the discriminant function a case has a high probability of belonging to one
more. Structure coefficients can be directly derived group but low probabilities of belonging to other
by calculating the correlation coefficients between groups. In this way, it is clear that the case
each of the predictor variables and the discrimi- should be classified into the group of the highest
nant scores. It addresses the question, To which of probability. However, if probabilities for all
the K variables is the discriminant function most groups are very close, it might be meaningless to
closely related? When the absolute value of the classify the case into a specific group given that
coefficient is very large (close to 1), the discrimi- groups are actually not very distinct based on
nant function is carrying nearly the same informa- the discriminant functions.
tion as the predictor variable. In comparison, When predicted group membership is com-
when the coefficient is near zero, the discriminant pared with actual group membership in the sam-
function and the predictor variable share little var- ple from which the function was calculated, the
iance. The discriminant function can be named percentage of correct predictions, often called
after the predictor variables that have the highest the hit ratio, can be calculated. To evaluate the
correlations. performance of classification, the hit ratio
should not be compared with zero but rather
with the percentage that would have been cor-
Classifications
rectly classified by chance. If the groups have
In discriminant analysis, discriminant functions equal sample sizes, the expected percentage of
can be used to make predictions of the group to correct predictions by chance is equal to 1/K,
which a case most likely belongs. Classification where K is the total number of groups. For
of an individual case involves calculation of the instance, for a two-group analysis with equal
individual’s discriminant score and comparison sample sizes, one can expect a 50% of chance of
of it with each of the group centroids. To make making correct predictions of group membership
predictions of group membership, the distance by pure random guesses, and therefore the
from the individual’s discriminant scores to each expected hit ratio based on chance is .5. If the
of the group centroids is measured, and the cen- hit ratio yielded by discriminant functions is .6,
troid to which the individual’s scores are closest the improvement is actually rather small. When
is the group to which the individual is predicted groups are unequal in size, the percentage that
to belong. A distance measure commonly used in could be correctly classified by chance can be
discriminant analysis is Mahalanobis D2 , which estimated by multiplying the expected probabili-
calculates the squared distance from a specific ties of each group by the corresponding group
case to each of the group centroids. D2 can be size, summing for all groups, and dividing the
considered a measure that represents the degree sum by the total sample size. A z test for the dif-
to which a case’s profile on the predictor vari- ference between proportions can be performed
ables resembles the typical profile of a group. to statistically test the significance of the
Based on this interpretation, a case should be improvement in the classification accuracy from
classified into the group with the smallest D2 . the discriminant analysis.
Because D2 is a statistic with a chi-square distri- It should be noted that the hit ratio tends to
bution of p degrees of freedom, where p is the overestimate the classification accuracy of dis-
number of predictor variables, the probabilities criminant functions when the same sample is
that a case belongs to a group can be calculated. used to both derive the discriminant function
Similar to D2 , these probabilities can also be and test its predictive ability. To overcome this,
374 Discussion Section
which the results can be applied to the general unbiasedness, unnecessary adjustment for non-
population of interest. Internal validity refers to the confounding variables always reduces the statis-
degree to which conclusions drawn from a study tical power of a study. Therefore, if both results
correctly describe what actually transpired during in a dual analysis are similar, then the unad-
the study. External validity refers to whether and to justed result is unbiased and should be reported
what extent the results of a study can be general- based on power considerations. If both results
ized to a larger population (the target population of are different, then the adjusted one should be
the study from which the sample was drawn, and reported based on validity considerations.
other populations across time and space). Below is a checklist of the items to be included
Threats to validity include selection bias in a discussion section:
(which occurs in the design stage of a study),
information bias (which occurs in the data col- 1. Overview: Provide a brief summary of the most
important parts of the introduction section and
lection stage of a study), and confounding bias
then the results section.
(which occurs in the data analysis stage of
a study). Selection bias occurs when during the 2. Interpretation: Relate the results back to the
selection step of the study, the participants in the initial study hypotheses. Do they support or fail
groups to be compared are not comparable to support the study hypotheses? It is also
because they differ in extraneous variables other important to discuss how the results relate to
the literature cited in the introduction.
than the independent variable under study. In
Comment on the importance and relevance of
this case, it would be difficult for the researcher
the findings and how the findings are related to
to determine whether the discrepancy in the the big picture.
groups is due to the independent variable or to
the other variables. Selection bias affects internal 3. Strengths and limitations: Discuss the strengths
validity. Selection bias also occurs when the and limitations of the study.
characteristics of subjects selected for a study are 4. Recommendations: Provide recommendations
systematically different from those of the target on the practical use of current study findings
population. This bias affects external validity. and suggestions for future research.
Selection bias may be reduced when group
assignment is randomized (in experiments) or The following are some tips for researchers to
selection processes are controlled for (in obser- follow in writing the discussion section: (a) Results
vational studies). Information bias occurs when do not prove hypotheses right or wrong. They sup-
the estimated effect is distorted either by an port them or fail to provide support for them.
error in measurement or by misclassifying the (b) In the case of a correlation study, causal lan-
participant for independent (exposure) and/or guage should not be used to discuss the results. (c)
dependent (outcome) variables. In experiments, Space is valuable in scientific journals, so being
information bias may be reduced by improving concise is imperative. Some journals ask authors to
the accuracy of measuring instruments and by restrict discussion to four pages or less, double
training technicians. In observational studies, spaced, typed. That works out to approximately
information bias may be reduced by pretesting one printed page. (d) When referring to informa-
questionnaires and training interviewers. Con- tion, data generated by the researcher’s own study
founding bias occurs when statistical controlling should be distinguished from published informa-
techniques (stratification or mathematical mod- tion. Verb tense is an important tool for doing
eling) are not used to adjust for the effects of that—past tense can be used to refer to work done;
confounding variables. Therefore, a distorted present tense can be used to refer to generally
estimate of the exposure effect results because accepted facts and principles.
the exposure effect is mixed with the effects of The discussion section is important because it
extraneous variables. Confounding bias may be interprets the key results of a researcher’s study in
reduced by performing a ‘‘dual’’ analysis (with light of the research hypotheses under study and
and without adjusting for extraneous variables). the published literature. It should provide a good
Although adjusting for confounders ensures indication of what the new findings from the
376 Dissertation
researcher’s study are and where research should As a means of maintaining high standards,
go next. many universities administer their doctoral pro-
gram through a graduate school with its own
Bernard Choi and Anita Pak dean. Additionally, some universities designate cer-
tain faculty, those who have proven they are
See also Bias; Methods Section; Results Section; Validity
researchers, as graduate faculty who participate in
of Research Conclusions
setting advanced degree policies and serve as major
professors and chairs. This dual faculty status has
Further Readings disappeared at most universities, however.
The dissertation process typically moves
Branson, R. D. (2004). Anatomy of a research paper.
through three stages: the proposal stage; the acti-
Respiratory Care, 49, 1222–1228.
Choi, P. T. (2005). Statistics for the reader: What to ask
vation stage, in which the research, thinking, or
before believing the results. Canadian Journal of producing work is accomplished; and the final
Anesthesia, 52, R1-R5. stage of presentation and approval. Though distin-
Hulley, S. B., Newman, T. B., & Cummings, S. R. guishable for explanatory purposes, these stages
(1988). The anatomy and physiology of research. are often blurred in practice. This is particularly
In S. B. Hulley & S. R. Cummings (Eds.), Designing evident when the area in which one intends to
clinical research (pp. 1–11). Baltimore: Williams & work is known, but not the specific aspect. For
Wilkins. example, the proposal and activation stage often
merge until the project outlines become clear.
In most cases one faculty member from the
department serves as the major professor or com-
DISSERTATION mittee chair (henceforth referred to as the chair).
This is usually at the invitation of the student,
As a requirement for an advanced university although some departments assign chairs in order
degree, the dissertation is usually the last to equitably balance faculty load. Additional fac-
requirement a candidate fulfills for a doctorate. ulty are recruited by the student to serve as readers
Probably its most salient characteristic is that it or committee members, often at the suggestion of
is a unique product, one that embodies in some the chair. Dissertation chairpersons and committee
way the creativity of the author—the result of members are chosen for their experience in the
research and of original thinking and the crea- candidate’s topic of interest and/or for some spe-
tion of a physical product. Depending on depart- cial qualifications, such as experience with the
mental tradition, some dissertations are expected research method or knowledge of statistics or
to be solely originated by the candidate; in experimental design.
others, the topic (and sometimes the approach as
well) is given by the major professor. But even in
The Proposal Stage
the latter case, the candidates are expected to
add something of their own originality to the Depending on the department’s tradition, the disser-
end result. tation may or may not be a collaborative affair with
This description of some relatively common fea- the faculty. Regardless, the dissertation, beginning
tures of the dissertation requirement applies pri- with the formulation of the problem in the pro-
marily to higher education in the United States. posal, is often a one-on-one, give-and-take relation
That there are common features owes much to between the candidate and the committee chair. In
communication among universities, no doubt How to Prepare a Dissertation Proposal, David R.
through such agencies as the Council of Graduate Krathwohl and Nick L. Smith described a disserta-
Schools and the American Association of Universi- tion proposal as a logical plan of work to learn
ties. But the requirement’s evolution at the local something of real or potential significance about an
level has resulted in considerable variation across area of interest. Its opening problem statement
the differing cultures of universities and even the draws the reader into the plan: showing its signifi-
departments within them. cance, describing how it builds on previous work
Dissertation 377
(both substantively and/or methodologically), and Faculty members who assume the role of disser-
outlining the investigation. The whole plan of tation chair take on a substantial commitment of
action flows from the problem statement: the activi- time, energy, and in some instances resources.
ties described in the design section, their sequence Agreeing to be the student’s chair usually involves
often illuminated graphically in the work plan (and, a commitment to help where able, such as in
if one is included, by the time schedule), and their procuring laboratories, equipment, participants,
feasibility shown by the availability of resources. access to research sites, and funding. Thus, nearly
Krathwohl and Smith point out that a well-written all faculty set limits on how many doctoral candi-
proposal’s enthusiasm should carry the reader along dates they will carry at any one time.
and reassure the reader with its technical and schol- In cases in which students take on problems
arly competence. A solid proposal provides the that are outside the interests of any departmental
reader with such a model of the clarity of thought faculty, students may experience difficulty in find-
and writing to be expected in the final write-up that ing a faculty member to work with them because
the reader feels this is an opportunity to support of the substantial additional time commitment and
research that should not be missed. the burden of gaining competence in another area.
While at first it may appear that this definition If no one accepts them, the students may choose to
suggests that the proposal should be written like change topic or, in some instances, transfer to
an advertisement, that is not what it is intended to another university.
convey. It simply recognizes the fact that if stu- Few view the proposal as a binding contract
dents cannot be enthusiastic about their idea, it is that if fulfilled, will automatically lead to
a lot to expect others to be. Material can be writ- a degree. Nevertheless, there is a sense that
ten in an interesting way and still present the idea the proposal serves as a good faith agreement
with integrity. It doesn’t have to be boring to be whereby if the full committee approves the pro-
good. posal and the student does what is proposed
Second, the definition points out that the pro- with sufficient quality (whatever that standard
posal is an integrated chain of reasoning that means in the local context), then the student has
makes strong logical connections between the fulfilled his or her part of the bargain, and the
problem statement and the coherent plan of action faculty members will fulfill theirs. Clearly, as an
the student has proposed undertaking. adjunct to its serving as a contract, the proposal
Third, this process means that students use this also serves as an evaluative criterion for ‘‘fulfill-
opportunity to present their ideas and proposed ing his or her part.’’
actions for consideration in a shared decision mak- In those institutions in which there is a formal
ing situation. With all the integrity at their com- admission to candidacy status, the proposal tends
mand, they help their chair or doctoral committee to carry with it more of a faculty commitment:
see how they view the situation, how the idea fills The faculty have deemed the student of sufficient
a need, how it builds on what has been done merit to make the student a candidate; therefore
before, how it will proceed, how pitfalls will be the faculty must do what they can to help the stu-
avoided, why pitfalls not avoided are not a serious dent successfully complete the degree.
threat, what the consequences are likely to be, and Finally, the proposal often becomes part of the
what significance they are likely to have. dissertation itself. The format for many disserta-
Fourth, while the students’ ideas and action tions is typically five chapters:
plans are subject to consideration, so also is their
capability to successfully carry them through. 1. Statement of the problem, why it is of some
Such a proposal definition gives the student importance, and what one hopes to be able to
a goal, but proposals serve many purposes besides show
providing an argument for conducting the study 2. A review of the past research and thinking on
and evidence of the student’s ability. Proposals also the problem, how it relates to what the student
serve as a request for faculty commitment, as intends to do, and how this project builds on it
a contract, as an evaluative criterion, and as a par- and possibly goes beyond it—substantively and
tial dissertation draft. methodologically
378 Dissertation
3. The plan of action (what, why, when, how, matter, but sciences have the highest completion
where, and who) rates after 10 years of candidacy—70% to 80%—
4. What was found, the data and its processing and English and the humanities the lowest—30%.
The likelihood of ABD increases when a student
5. Interpretation of the data in relation to the leaves campus before finishing. Reasons vary, but
problem proposed.
finances are a major factor. Michael T. Nettles and
Catherine M. Millett’s Rate of Progress scale
Many departments require students to prepare the allows students to compare their progress with
proposal as the first three chapters to be used in that of peers in the same field of study. The PhD
the dissertation with appropriate modification. Completion Project of the Council of Graduate
It is often easier to prepare a proposal when the Schools aims to find ways of decreasing ABD
work to be done can be preplanned. Many propo- levels.
sals, however, especially in the humanities and
more qualitatively oriented parts of the social
sciences, are for emergent studies. The focus of The Final Stage
work emerges as the student works with a given Many dissertations have a natural stopping
phenomenon. Without a specific plan of work, the point: the proposed experiment concludes, one is
student describes the study’s purpose, the approach, no longer learning anything new, reasonably
the boundaries of the persons and situations as well available sources of new data are exhausted. The
as rules for inclusion or exclusion, and expected study is closed with data analysis and interpreta-
findings. Since reality may be different, rules for tion in relation to what one proposed. But if one
how much deviation requires further approval are is building a theoretical model, developing or
appropriate. critiquing a point of view, describing a situation,
Practice varies, but all institutions require pro- or developing a physical product, when has one
posal approval by the chair, if not the whole com- done enough? Presumably when the model is
mittee. Some institutions require very formal adequately described, the point of view appro-
approval, even an oral examination on the pro- priately presented, or the product works on
posal; others are much looser. some level. But ‘‘adequate,’’ ‘‘appropriate,’’ and
‘‘some level’’ describe judgments that must be
made by one’s chair and committee, and their
The Activation Phase
judgment may differ from that of the student.
This phase also varies widely in how actively the The time to face the decision of how much is
chair and committee monitor or work with the enough is as soon as the research problem is suf-
student. Particularly where faculty members are ficiently well described that criteria that the
responsible for a funded project supporting a dis- chair and committee deem reasonable can be
sertation, monitoring is an expected function. But ascribed to it—a specified period of observations
for many, just how far faculty members are or number of persons to be queried, certain
expected or desired to be involved in what is sup- books to be digested and brought to bear, and so
posed to be the student’s own work is a fine line. forth. While not a guaranteed fix because minds
Too much and it becomes the professor’s study change as the problem emerges, the salience of
rather than the student’s. Most faculties let the stu- closure conditions is always greater once the
dents set the pace and are available when called on issue is raised.
for help. Dissertations are expected to conform to the
Completion appears to be a problem in all standard of writing and the appropriate style guide
fields; it is commonly called the all-but-dissertation for their discipline. In the social sciences this guide
(ABD) problem. Successful completion of 13 to 14 is usually either American Psychological Associa-
years of education and selection into a doctoral tion or Modern Language Association style.
program designates these students as exceptional, Once the chair is satisfied (often also the com-
so for candidates to fail the final hurdle is a waste mittee), the final step is, in most cases, an oral
of talent. Estimates vary and differ by subject examination. Such examinations vary in their
Distribution 379
distributed but are approximately normal. all values within the distribution. The area
According to the central limit theorem, as the between the curve and the horizontal axis is some-
sample size increases, the shape of the distribu- times referred to as the area under the curve. Gen-
tion of the sample means taken will approach erally speaking, distributions that are based on
a normal distribution. continuous data tend to cluster around an average
score, or measure of central tendency. The measure
of central tendency that is most often reported in
Discrete Distributions
research is known as the mean. Other measures of
The most commonly discussed discrete probability central tendency include the median and the mode.
distribution is the binominal distribution. The Bimodal distributions occur when there are two
binomial distribution is concerned with scores that modes or two values that occur most often in the
are dichotomous in nature, that is, there can be distribution of scores. The median is often used
only one of two possible outcomes. The Bernoulli when there are extreme values at either end of the
trial (named after mathematician Jakob Bernoulli) distribution. When the mean, median, and mode
is a good example that is often used when teaching are the same value, the curve tends to be bell
students about a binomial distribution of scores. shaped, or normal, in nature. This is one of the
The most often discussed Bernoulli trial is that of unique features of what is known as the normal
flipping a coin, in which the outcome will be either distribution. In such cases, the curve is said to be
heads or tails. The process allows for estimating symmetrical about the mean, which means that
the probability that an event will occur. Binomial the shape is the same on both sides. In other
distributions can also be used when one wants to words, if one drew a perpendicular line through
determine the probability associated with correct the mean score, each side of the curve would be
or incorrect responses. In this case, an example a perfect reflection the other. Skewness is the term
might be a 10-item test that is scored dichoto- used to measure a lack of symmetry in a distribu-
mously (correct/incorrect). A binomial distribution tion. Skewness occurs when one tail of the distri-
allows us to calculate the probability of scoring 5 bution is longer than the other. Distributions can
out of 10, 6 out of 10, 7 out of 10, and so on, cor- be positively or negatively skewed depending on
rect. Because the calculation of binomial probabil- which tail is longer. In addition, distributions can
ity distributions can become somewhat tedious, differ in the amount of variability. Variability
binomial distribution tables often accompany explains the dispersion of scores around the mean.
many statistics textbooks so that researchers can Distributions with considerable dispersion around
quickly access information regarding such esti- the mean tend to be flat when compared to the
mates. It should be noted that binomial distribu- normal curve. Distributions that are tightly dis-
tions are most often used in nonparametric persed around the mean tend to be peaked in
procedures. Chi-square distributions are another nature when compared to the normal curve with
form of a discrete distribution that is often used the majority of scores falling very close to the
when one wants to report whether an expected mean. In cases in which the distribution of scores
outcome occurred due to chance alone. appears to be flat, the curve is said to be platykur-
tic, and distributions that are peaked compared
with the normal curve are said to be leptokurtic in
Continuous Distributions
nature. The flatness or peakedness of a distribu-
Continuous variables can be any value or interval tion is a measure of kurtosis, which, along with
associated with a number line. In theory, a continu- variability and skewness, helps explain the shape
ous variable can assume an infinite number of pos- of a distribution of scores. Each normally dis-
sible values with no gaps among the intervals. This tributed variable will have its own measure of
is sometimes referred to as a ‘‘smooth’’ process. To central tendency, variability, degree of skewness,
graph a continuous probability distribution, one and kurtosis. Given this fact, the shape and loca-
draws a horizontal axis that represents the values tion of the curves will vary for many normally
associated with the continuous variable. Above the distributed variables. To avoid needing to have
horizontal axis is drawn a curve that encompasses a table of areas under the curve for each
Disturbance Terms 381
normally distributed variable, statisticians have another observed variable (say, x). To answer the
simplified things through the use of the standard question, researchers may construct the model in
normal distribution based on a z-score metric which y depends on x. Although y is not neces-
with a mean of zero and a standard deviation of sarily explained only by x, a discrepancy always
1. By standardizing scores, we can estimate the exists between the observed value of y and the
probability that a score will fall within a certain predicted value of y obtained from the model.
region under the normal curve. Parametric test The discrepancy is taken as a disturbance term
statistics are typically applied to data that or an error term.
approximate a normal distribution, and t distri- Suppose that n sets of data, ðx1 , y1 Þ, ðx2 , y2 Þ;
butions and F distributions are often used. As . . . , ðxn , yn Þ, are observed, where yi is a scalar and
with the binomial distribution and chi-square xi is a vector (say, 1 × k vector). We assume that
distribution, tables for the t distribution and F there is a relationship between x and y, which is
distribution are typically found in most intro- represented as the model y ¼ f ðxÞ, where f ðxÞ is
ductory statistics textbooks. These distributions a function of x. We say that y is explained by x, or
are used to examine the variance associated with y is regressed on x. Thus y is called the dependent
two or more sets of sample means. Because the or explained variable, and x is a vector of the inde-
sampling distribution of scores may vary based pendent or explanatory variables. Suppose that
on the sample size, the calculation of both the t a vector of the unknown parameter (say β, which
and F distributions includes something called is a k × 1 vector) is included in f ðxÞ. Using the n
degrees of freedom, which is an estimate of the sets of data, we consider estimating β in f ðxÞ. If
sample size for the groups under examination. we add a disturbance term (say u, which is also
When a distribution is said to be nonnormal, the called an error term), we can express the relation-
use of nonparametric or distribution-free statis- ship between y and x as y ¼ f ðxÞ þ u. The distur-
tics is recommended. bance term u indicates the term that cannot be
explained by x. Usually, x is assumed to be nonsto-
Vicki Schmitt chastic. Note that x is said to be nonstochastic
when it takes a fixed value. Thus f ðxÞ is determin-
See also Bernoulli Distribution; Central Limit Theorem;
istic, while u is stochastic. The researcher must
Frequency Distribution; Kurtosis; Nonparametric
specify f ðxÞ. Representatively, it is often specified
Statistics; Normal Distribution; Parametric Statistics
as the linear function f ðxÞ ¼ xβ.
The reasons a disturbance term u is necessary
Further Readings are as follows: (a) There are some unpredictable
elements of randomness in human responses,
Bluman, A. G. (2009). Elementary statistics: A step by
(b) an effect of a large number of omitted variables
step approach (7th ed.). Boston: McGraw-Hill.
Howell, D. C. (2010). Statistical methods for psychology
is contained in x, (c) there is a measurement error
(7th ed.). Belmont, CA: Wadsworth Cengage in y, or (d) a functional form of f ðxÞ is not known
Learning. in general. Corresponding examples are as follows:
Larose, D. T. (2010). Discovering statistics. New York: (a) Gross domestic product data are observed as
Freeman. a result of human behavior, which is usually
Salkind, N. J. (2008). Statistics for people who (think unpredictable and is thought of as a source of ran-
they) hate statistics (3rd ed.). Thousand Oaks, CA: domness. (b) We cannot know all the explanatory
Sage. variables that depend on y. Most of the variables
are omitted, and only the important variables
needed for analysis are included in x. The influence
of the omitted variables is thought of as a source
DISTURBANCE TERMS of u. (c) Some kinds of errors are included in
almost all the data, either because of data collec-
In the field of research design, researchers often tion difficulties or because the explained variable is
want to know whether there is a relationship inherently unmeasurable, and a proxy variable has
between an observed variable (say, y) and to be used in their stead. (d) Conventionally we
382 Disturbance Terms
to solve difference equations involving probabili- There are two additional aspects of this work
ties, and offers original and less restricted treat- that are worth mentioning, even if not as impor-
ments of the duration of play in games of chance tant as the normal curve itself.
(i.e., the ‘‘gambler’s ruin’’ problem). Although First, de Moivre established a special case of the
some of the mathematical terminology and nota- central limit theorem that is sometimes referred to
tion is archaic, with minor adjustments and dele- as the theorem of de Moivre–Laplace. In effect,
tions The Doctrine of Chances could still be the theorem states that as the number of indepen-
used today as a textbook in probability theory. dent (Bernoulli) trials increases indefinitely, the
Because the author intended to add to his mea- binomial distribution approaches the normal dis-
ger income by the book’s sale, it was written in tribution. De Moivre illustrated this point by
a somewhat more accessible style than a pure showing that a close approximation to the normal
mathematical monograph. curve could be obtained simply by flipping a coin
Yet from the standpoint of later developments, a sufficient number of times. This demonstration is
the most critical contribution can be found on basically equivalent to that of the bean machine or
pages 243–254 of the 3rd edition (or pages 235– quincunx that Francis Galton invented to make
243 of the 2nd edition), which are tucked between the same point.
the penultimate and final problems. It is here that Second, de Moivre offered the initial compo-
de Moivre presented ‘‘a method of approximating nents of what later became known as the Poisson
the sum of the terms of the binomial (a þ b)n approximation to the binomial distribution, albeit
expanded into a series, from whence are deduced it was left to Siméon Poisson to provide this deri-
some practical rules to estimate the degree of vation the treatment it deserved. Given this frag-
assent which is to be given to experiments’’ (put in mentary achievement and others, one can only
modern mathematical notation and expressed in imagine what de Moivre would have achieved had
contemporary English orthography). Going he obtained a chair of mathematics at a major
beyond Bernoulli’s work (and that of Nicholas Ber- European university.
noulli, Jakob’s nephew), the approximation is
nothing other than the normal (or Gaussian)
Aftermath
curve.
Although de Moivre did not think of the Like his predecessors Huygens, Montmort, and
resulting exponential function in terms of a prob- Jakob Bernoulli, de Moivre was primarily interested
ability density function, as it is now conceived, in what was once termed direct probability. That is,
he clearly viewed it as describing a symmetrical given a particular probability distribution, the goal
bell-shaped curve with inflection points on both was to infer the probability of a specified event. To
sides. Furthermore, even if he did not possess the offer a specific example, the aim was to answer
explicit concept of the standard deviation, which questions such as What is the probability of throw-
constitutes one of two parameters in the modern ing a score of 12 given three throws of a regular
formula (the other being the mean), de Moivre six-faced die? In contrast, these early mathemati-
did have an implicit idea of a distinct and fixed cians were not yet intrigued by problems in inverse
unit that meaningfully divided the curve on probability. In this case the goal is to infer the
either side of the maximum point. By hand cal- underlying probability distribution that would most
culation he showed that the probabilities of out- likely produce a set of observed events. An instance
comes coming within ± 1, 2, and 3 of these would be questions like, Given that 10 coin tosses
units would be .6827, .9543, and .9987 (round- yielded 6 heads and 4 tails, what is the probability
ing his figures to four decimal places). The corre- that it is still an unbiased coin? and how many coin
sponding modern values for ± 1, 2, and 3 tosses would we need before we knew that the coin
standard deviations from the mean are .6826, was unbiased with a given degree of confidence?
.9544, and .9974. Taken together, de Moivre’s Inverse probability is what we now call statistical
understanding was sufficient to convince Karl inference—the inference of population properties
Pearson and others to credit him with the origi- from small random samples taken from that
nal discovery of the normal curve. population. How can we infer the population
386 Double-Blind Procedure
distribution from the sample distribution? How the method of least squares to minimize those
much confidence can we place in using the sample errors. Then Pierre-Simon Laplace, while working
mean as the estimate of the population mean? on the central limit theorem, discovered that the
This orientation toward direct rather than distribution of sample means tends to be described
inverse probability makes good sense historically. by a normal distribution, a result that is indepen-
As already noted, probability theory was first dent of the population distribution. Adolphe Quéte-
inspired by games of chance. And such games let later showed that human individual differences
begin with established probability distributions. in physical characteristics could be described by the
That is how each game is defined. So a coin toss same curve. The average person (l’homme moyen)
should have two equally likely outcomes, a die was someone who resided right in the middle of the
throw six equally likely outcomes, and a single distribution. Later still Galton extended this appli-
draw from a full deck of cards, 52 equally likely cation to individual differences in psychological
outcomes. The probabilities of various compound attributes and defined the level of ability according
outcomes—like getting one and only one ace in to placement on this curve. In due course the con-
three throws—can therefore be derived in a direct cept of univariate normality was generalized to
and methodical manner. In these derivations one those of bivariate and multivariate normality. The
certainty (the outcome probability) is derived from normal distribution thus became the single most
another certainty (the prior probability distribu- important probability distribution in the behavioral
tion) by completely certain means (the laws of and social sciences—with implications that went
probability). By comparison, because inverse prob- well beyond what de Moivre had more modestly
ability deals with uncertainties, conjectures, and envisioned in The Doctrine of Chances.
estimates, it seems far more resistant to scientific
analysis. It eventually required the introduction of Dean Keith Simonton
such concepts as confidence intervals and probabil-
ity levels. See also Game Theory; Probability, Laws of; Significance
It is telling that when Jakob Bernoulli attempted Level, Concept of; Significance Level, Interpretation
to solve a problem of the latter kind, he dramati- and Construction
cally failed. He specifically dealt with an urn
model with a given number of black and white Further Readings
pebbles. He then asked how many draws (with
de Moivre, A. (1967). The doctrine of chances (3rd ed.).
replacement) a person would have to make before
New York: Chelsea. (Original work published 1756)
the relative frequencies could be stated with an
Hald, A. (2003). A history of probability and statistics
a priori level of confidence. After much mathemat- and their applications before 1750. Hoboken, NJ:
ical maneuverings—essentially constituting the first Wiley.
power analysis—Bernoulli came up with a ludi- Stigler, S. M. (1986). The history of statistics: The
crous answer: 25,550 observations or tests. Not measurement of uncertainty before 1900. Cambridge,
surprisingly, he just ended Ars Conjectandi right MA: Harvard University Press.
there, apparently without a general conclusion,
and left the manuscript unpublished at his death.
Although de Moivre made some attempt to con-
tinue from where his predecessor left off, he was DOUBLE-BLIND PROCEDURE
hardly more successful, except for the derivation
of the normal curve. A double-blind procedure refers to a procedure in
It is accordingly ironic that the normal curve which experimenters and participants are ‘‘blind
eventually provided a crucial contribution to statis- to’’ (without knowledge of) crucial aspects of
tical inference and analysis. First, Carl Friedrich a study, including the hypotheses, expectations, or,
Gauss interpreted the curve as a density function most important, the assignment of participants to
that could be applied to measurement problems in experimental groups. This entry discusses the
astronomy. By assuming that errors of measure- implementation and application of double-blind
ment were normally distributed, Gauss could derive procedures, along with their historical background
Double-Blind Procedure 387
and some of the common criticisms directed Double-blind studies are normally also evalu-
at them. ated ‘‘blind.’’ Here, the data are input by auto-
matic means (Internet, scanning), or by assistants
blind to group allocation of participants. Whatever
Experimental Control
procedures are done to prepare the database for
‘‘Double-blinding’’ is intimately coupled to ran- analysis, such as transformations, imputations of
domization, where participants in an experimental missing values, and deletion of outliers, is done
study are allocated to groups according to a ran- without knowledge of group assignment. Nor-
dom algorithm. Participants and experimenters are mally a study protocol stipulates the final statisti-
then blinded to group allocation. Hence double- cal analysis in advance. This analysis is then run
blinding is an additional control element in experi- with a database that is still blinded in the sense
mental studies. If only some aspect of a study is that the groups are named ‘‘A’’ and ‘‘B.’’ Only after
blinded, it is a single-blind study. This is the case this first and definitive analysis has been conducted
when the measurement of an outcome parameter and documented is the blind broken.
is done by someone who does not know which Good clinical trials also test whether the blind-
group a participant belongs to and what hypothe- ing was compromised during the trial. If, for
ses and expectations are being tested. This could, instance, a substance or intervention has many and
in principle, also be done in nonexperimental stud- characteristic side effects, then patients or
ies if, for instance, two naturally occurring clinicians can often guess whether someone was
cohorts, smokers and nonsmokers, say, are tested allocated to treatment (often also called verum,
for some objective marker, such as intelligence or from the Latin word for true) or placebo. To test
plasma level of hormones. Double-blinding pre- for the integrity of the blinding procedure, either
supposes that participants are allocated to the all participants or a random sample of them are
experimental procedure and control procedure at asked, before the blind is broken, what group
random. Hence, by definition, natural groups or they think they had been allocated to. In a good,
cohorts cannot be subject to double-blinding. Dou- uncompromised blinded trial, there will be a
ble-blind testing is a standard for all pharmaceuti- near-random answer pattern because some patients
cal substances, such as drugs, but should be will have improved under treatment and some
implemented whenever possible in all designs. In under control.
order for a study to succeed with double-blinding,
a control intervention uses a placebo that can be
manufactured in a way that makes the placebo
Placebos
indistinguishable from the treatment.
To make blinding of patients and clinicians possi-
ble, the control procedure has to be a good mock
Allocation Concealment and Blind Analysis
or placebo procedure (also sometimes called
There are two corollaries to double-blinding: allo- sham). In pharmaceutical trials this is normally
cation concealment and blind statistical analysis. If done by administering the placebo in a capsule or
an allocation algorithm, that is, the process of allo- pill of the same color but containing pharmacolog-
cating participants to experimental groups, is com- ically inert material, such as corn flour. If it is nec-
pletely random, then, by definition, the allocation essary to simulate a taste, then often other
of participants to groups is concealed. If someone substances or coloring that are inactive or only
were to allocate participants to groups in an alter- slightly active, such as vitamin C, are added. For
nating fashion, then the allocation would not be instance, if someone wants to create a placebo for
concealed. The reason is that if someone were to caffeine, quinine can be used. Sometimes, if a phar-
be unblinded, because of an adverse event, say, macological substance has strong side effects, an
then whoever knew about the allocation system active placebo might be used. This is a substance
could trace back and forth from this participant that produces some of the side effects, but hardly
and find out about the group allocation of the any of the desired pharmacological effects, as the
other participants. experimental treatment.
388 Double-Blind Procedure
and used a screen to blind the participants from speaking, blinded clinical trials are cost inten-
seeing the actual weights. Blinded tests became sive, and researchers will likely not be able to
standard in hypnosis and parapsychological muster the resources to run sufficient numbers
research. Gradually medicine also came to of isolated, blinded trials on all components to
understand the importance and power of sugges- gain enough certainty. Theoretically, therapeutic
tion and expectation. The next three important packages come in a bundle that falls apart if one
dates are the publication of Methodenlehre were to disentangle them into separate elements.
der therapeutischen Untersuchung (Clinical So care has to be taken not to overgeneralize the
Research Methodology) by German pharmacolo- pharmacological model to all situations.
gist Paul Martini in 1932, the introduction of Blinding is always a good idea, where it can be
randomization by Ronald Fisher’s Design of implemented, because it increases the internal
Experiments in 1935, and the 1945 Cornell con- validity of a study. Double-blinding is necessary if
ferences on therapy, which codified the blinded one wants to know the specific effect of a mecha-
clinical trial. nistic intervention.
Harald Walach
Caveats and Criticisms
See also Experimenter Expectancy Effect; Hawthorne
Currently there is a strong debate over how to
Effect; Internal Validity; Placebo; Randomization
balance the merits of strict experimental control
Tests
with other important ingredients of therapeutic
procedures. The double-blind procedure has
grown out of a strictly mechanistic, pharmaco- Further Readings
logical model of efficacy, in which only a single
specific physiological mechanism is important, Committee for Proprietary Medicinal Products
Working Party on Efficacy of Medicinal
such as the blocking of a target receptor, or one
Products. (1995). Biostatistical methodology in
single psychological process that can be clinical trials in applications for marketing
decoupled from contexts. Such careful focus can authorizations for medicinal products. CPMP working
be achieved only in strictly experimental party on efficacy of medicinal products note for
research with animals and partially also with guidance III/3630/92-EN. Statistics in Medicine, 14,
humans. But as soon as we reach a higher level 1659–1682.
of complexity and come closer to real-world Crabtree, A. (1993). From Mesmer to Freud: Magnetic
experiences, such blinding procedures are not sleep and the roots of psychological healing. New
necessarily useful or possible. The real-world Haven, CT: Yale University Press.
effectiveness of a particular therapeutic interven- Greenberg, R. P., Bornstein, R. F., Greenberg, M. D., &
Fisher, S. (1992). A meta-analysis of antidepressant
tion is likely to consist of a specific, mechanisti-
outcome under ‘‘blinder’’ conditions. Journal of
cally active ingredient that sits on top of Consulting & Clinical Psychology, 60, 664–669.
a variety of other effects, such as strong, nonspe- Kaptchuk, T. J. (1998). Intentional ignorance: A history
cific effects of relief from being in a stable thera- of blind assessment and placebo controls in medicine.
peutic relationship; hope that a competent Bulletin of the History of Medicine, 72, 389–433.
practitioner is structuring the treatment; and Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G.
reduction of anxiety through the security given (1995). Empirical evidence of bias: Dimensions of
by the professionalism of the context. This methodological quality associated with estimates of
approach has been discussed under the catch- treatment effects in controlled trials. Journal of the
word whole systems research, which acknowl- American Medical Association, 273, 408–412.
Shelley, J. H., & Baur, M. P. (1999). Paul Martini:
edges (a) that a system or package of care is
The first clinical pharmacologist? Lancet, 353,
more than just the sum of all its elements and (b) 1870–1873.
that it is unrealistic to assume that all complex White, K., Kando, J., Park, T., Waternaux, C., & Brown,
systems of therapy can be disentangled into their W. A. (1992). Side effects and the ‘blindability’ of
individual elements. Both pragmatic and theoret- clinical drug trials. American Journal of Psychiatry,
ical reasons stand against it. Pragmatically 149, 1730–1731.
390 Dummy Coding
group is left over, meaning this group is the refer- the means of the divorced and single groups. The
ence group. However, the overall results will be second b weight represents the difference in means
the same no matter which groups we select. between the divorced and married groups.
Satis
Column Column Group Dummy Coding in Multiple
Group Satisfaction 1 2 Mean
Regression With Categorical Variables
Single 25 1 0 24.80
Multiple regression is a linear transformation of
S 28 1 0
the X variables such that the sum of squared
S 20 1 0
deviations of the observed and predicted Y is mini-
S 26 1 0
mized. The prediction of Y is accomplished by the
S 25 1 0
following equation:
Married 30 0 1 30.20
M 28 0 1 Y0i ¼ b0 þ b1 X1i þ b2 X2i þ þ bk Xki
M 32 0 1
M 33 0 1
Categorical variables with two levels may be
M 28 0 1
entered directly as predictor variables in a multiple
Divorced 20 0 0 23.8
regression model. Their use in multiple regression
D 22 0 0
is a straightforward extension of their use in sim-
D 28 0 0
ple linear regression. When they are entered as pre-
D 25 0 0
dictor variables, interpretation of regression
D 24 0 0
weights depends on how the variable is coded.
Grand Mean 26.27 0.33 0.33
When a researcher wishes to include a categorical
variable with more than two levels in a multiple
Note there are three groups and thus 2 degrees regression prediction model, additional steps are
of freedom between groups. Accordingly, there are needed to ensure that the results are interpretable.
two dummy-coded variables. If X1 denotes single, These steps include recoding the categorical vari-
X2 denotes married, and X3 denotes divorced, able into a number of separate, dichotomous vari-
then the single group is identified when X1 is 1 ables: dummy coding.
and X2 is 0; the married group is identified when
X2 is 1 and X1 is 0; and the divorced group is
identified when both X1 and X2 are 0. Example Data: Faculty Salary Data
If Y^ denotes the predicted level of life satisfac-
tion, then we get the following regression equation: Faculty Salary Gender Rank Dept Years Merit
^ ¼ a þ b1ðX1Þ þ b2ðX2Þ,
Y 1 Y1 0 3 1 0 1.47
2 Y2 1 2 2 8 4.38
where a is the interception, and b1 and b2 are 3 Y3 1 3 2 9 3.65
slopes or weights. The divorced group is identified 4 Y4 1 1 1 0 1.64
when both X1 and X2 are 0, so it drops out of the 5 Y5 1 1 3 0 2.54
regression equation, leaving the predicted value 6 Y6 1 1 3 1 2.06
equal to the mean of the divorced group. 7 Y7 0 3 1 4 4.76
The group that gets all 0s is the reference group. 8 Y8 1 1 2 0 3.05
For this example, the reference group is the 9 Y9 0 3 3 3 2.73
divorced group. The regression coefficients present 10 Y10 1 2 1 0 3.14
a contrast or difference between the group identi-
fied by the column and the reference group. To be The simplest case of dummy coding is one in
specific, the first b weight corresponds to the single which the categorical variable has three levels
group and the b1 represents the difference between and is converted to two dichotomous variables.
392 Dummy Coding
Psychology 1 1 0
Combinations and Interaction of
Curriculum 2 0 1
Special Education 3 0 0 Categorical Predictor Variables
The previous examples dealt with individual cate-
gorical predictor variables with two or more levels.
A listing of the recoded data is presented below. The following example illustrates how to create
a new dummy coded variable that represents the
Faculty Dept Psyc Curri Salary interaction of certain variables.
Suppose we are looking at how gender, parental
1 1 1 0 Y1 responsiveness, and the combination of gender and
2 2 0 1 Y2 parental responsiveness influence children’s social
3 2 0 1 Y3 confidence. Confidence scores serve as the depen-
4 1 1 0 Y4 dent variable, with gender and parental responsive-
5 3 0 0 Y5 ness scores (response) serving as the categorical
6 3 0 0 Y6 independent variables. Response has three levels:
7 1 1 0 Y7 high level, medium level, and low level. The analy-
8 2 0 1 Y8 sis may be thought of as a two-factor ANOVA
9 3 0 0 Y9 design, as below:
10 1 1 0 Y10
Response Scale Values
between gender and response. Gender is already As in an ANOVA design analysis, the first
dummy coded in the data file with males ¼ 0 and hypothesis of interest to be tested is the interaction
females ¼ 1. Next, we will dummy code the effect. In a multiple regression model, the first
response variable into two dummy coded variables, analysis tests this effect in terms of its ‘‘unique
one new variable for each degree among groups for contribution’’ to the explanation of confidence
the response main effect (see below). Note that the scores. This can be realized by entering the
low level of response is the reference group (shaded response dummy-coded variables as a block into
in the table below). Therefore, we have 2 (i.e., the model after gender has been entered. The Gen-
3 1) dummy-coded variables for response. der Response dummy-coded interaction variables
are therefore entered as the last block of variables
Dummy Coding in creating the ‘‘full’’ regression model, that is, the
Original Coding Low Medium High model with all three effects in the equation. Part of
the coefficients table is presented below.
0 (Low) 0 0 0
1 (Medium) 0 1 0
2 (High) 0 0 1 Model B t p
Last comes the third dummy-coded variables, 1 (Constant) 18.407 29.66 .000**
which represent the interaction source of variability. Gender 2.073 2.453 .017*
The new variables are labeled as G*Rmid (meaning 2 (Constant) 19.477 22.963 .000**
gender interacting with the medium level of Gender 2.045 2.427 .018*
response), G*Rhigh (meaning gender interacting Response (mid) –1.401 –1.518 .133
with the high level of response), and G*Rlow Response (high) –2.270 –1.666 .100
(meaning gender interacting with the low level of 3 (Constant) 20.229 19.938 .000**
response), which is the reference group. To dummy Gender .598 .425 .672
code variables that represent interaction effects of Response (mid) –1.920 –1.449 .152
categorical variables, we simply use the products of Response (high) –5.187 –2.952 .004*
the dummy codes that were constructed separately G*Rmid 1.048 .584 .561
for each of the variables. In this case, we simply G*Rhigh 6.861 2.570 .012
multiply gender by dummy coded response. Note
Notes: *p < .025. **p < .001.
that there are as many new interaction dummy
coded variables created as there are degrees of free-
The regression equation is
dom for the interaction term in the ANOVAs design.
The newly dummy coded variables are as follows: ^ ¼ 20:229 þ :598X1 1:920X2 5:187X3
Y
Gender Response: Response: þ 1:048X4 6:861X5,
(main Medium High G*Rmid G*Rhigh
effect) (main effect) (main effect) (interaction) (interaction) in which X1 ¼ gender (female), X2 ¼ response
(medium level), X3 ¼ response (high level), X4 ¼
0 0 0 0 0 Gender by Response (Female with Mid Level), and
0 0 0 0 0 X5 ¼ Gender by Response (Female with High
0 1 0 0 0 Level). Now, b1 ¼ :598 tells us girls receiving
0 1 0 0 0 a low level of parental response have higher confi-
0 0 1 0 0 dence scores than boys receiving the same level of
0 0 1 0 0 parental response, but this difference is not signifi-
1 0 0 0 0 cant (p ¼ .672 > .05), and b2 ¼ 1.920 tells us
1 0 0 0 0 children receiving a medium level of parental
1 1 0 1 0 response have lower confidence scores than chil-
1 1 0 1 0 dren receiving a low level of parental response, but
1 0 1 0 1 this difference is not significant (p ¼ .152 > .05).
1 0 1 0 1 However, b5 ¼ 6:861 tells us girls receiving a high
394 Duncan’s Multiple Range Test
level of parental response tend to score higher than significance levels for the difference between any
do boys receiving this or other levels of response, pair of means, regardless of whether a significant
and this difference is significant (p ¼ .012 < .025). F resulted from an initial analysis of variance.
The entire model tells us the pattern of mean confi- Duncan’s test differs from the Newman–Keuls
dence scores across the three parental response test (which slightly preceded it) in that it does
groups for boys is sufficiently different from the not require an initial significant analysis of vari-
pattern of mean confidence scores for girls across ance. It is a more powerful (in the statistical
the three parental response groups ðp < :001Þ. sense) alternative to almost all other post
hoc methods.
Jie Chen When introducing the test in a 1955 article in
the journal Biometrics, David B. Duncan
See also Categorical Variable; Estimation
described the procedures for identifying which
pairs of means resulting from a group compari-
Further Readings son study with more than two groups are signifi-
cantly different from each other. Some sample
Aguinis, H. (2004). Regression analysis for categorical
mean values taken from the example presented
moderators. New York: Guilford.
Aiken, L. S., & West, S. G. (1991). Multiple regression,
by Duncan are given. Duncan worked in agro-
testing and interpreting interaction. Thousand Oaks, nomics, so imagine that the means represent
CA: Sage. agricultural yields on some metric. The first step
Allison, P. D. (1999). Multiple regression: A primer. in the analysis is to sort the means in order from
Thousand Oaks, CA: Pine Forge Press. lowest to highest, as shown.
Brannick, M. T. Categorical IVs: Dummy, effect, and
orthogonal coding. Retrieved September 15, 2009,
from http://luna.cas.usf.edu/~mbrannic/files/regression/
anova1.html Groups A F G D C B E
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Means 49.6 58.1 61.0 61.5 67.6 71.2 71.3
Applied multiple regression/correlation analysis for the
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Gupta, R. (2008). Coding categorical variables in
regression models: Dummy and effect coding. From tables of values that Duncan developed
Retrieved September 15, 2009, from http:// from the t-test formula, standard critical
www.cscu.cornell.edu/news/statnews/stnews72.pdf differentials at the .05 level are identified. These
Keith, T. Z. (2006). Multiple regression and beyond. are significant studentized differences, which
Boston: Allyn & Bacon. must be met or surpassed. To maintain the nomi-
Pedhazur, E. J. (1997). Multiple regression in behavioral
nal significance level one has chosen, these differ-
research: Explanation and prediction. Eagan, MN:
Thomson Learning.
entials get slightly higher as the two means that
Stockburger, D. W. (1997). Multiple regression are compared become further apart in terms of
with categorial variables. Retrieved September 15, their rank ordering. In the example shown, the
2009, from http://www.psychstat.missouristate.edu/ means for groups A and F have an interval of 2
multibook/mlt08m.html. because they are adjacent to each other. Means
Warner, R. M. (2008). Applied statistics: From bivariate A and E have an interval of 7 as there are seven
through multivariate techniques. Thousand Oaks, CA: means in the span between them. By multiplying
Sage. the critical differentials by the standard error of
the mean, one can compute the shortest signifi-
cant ranges for each interval width (in the
example, the possible intervals are 2, 3, 4, 5, 6,
DUNCAN’S MULTIPLE RANGE TEST and 7). With the standard error of the mean of
3.643 (which is supplied by Duncan for this
Duncan’s multiple range test, or Duncan’s test, or example), the shortest significant ranges are
Duncan’s new multiple range test, provides calculated.
Dunnett’s Test 395
5 3.643.743.793.833.833.833.833.833.83
10 3.153.303.373.433.463.473.473.473.47 DUNNETT’S TEST
15 3.013.163.253.313.363.383.403.423.43
20 2.953.103.183.253.303.343.363.383.40 Dunnett’s test is one of a number of a posteriori or
30 2.893.043.123.203.253.293.323.353.37 post hoc tests, run after a significant one-way anal-
60 2.832.983.083.143.203.243.283.313.33 ysis of variance (ANOVA), to determine which dif-
100 2.802.953.053.123.183.223.263.293.32 ferences are significant. The procedure was
Note: Significant Studentized Ranges at the .05 Level for introduced by Charles W. Dunnett in 1955. It
Duncan’s Multiple Range Test. differs from other post hoc tests, such as the
Newman–Keuls test, Duncan’s Multiple Range
The philosophical approach taken by Duncan is test, Scheffé’s test, or Tukey’s Honestly Significant
an unusually liberal one. It allows for multiple Difference test, in that its use is restricted to
396 Dunnett’s Test
Background Xi XR
q0 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
,
A one-way ANOVA tests the null hypothesis ðH0 Þ 2 × MSerror n þ n 1 1
that all the k treatment means are equal; that is, i R
Table 1 Results of a Fictitious Experiment Comparing Table 2 The ANOVA Table for the Results in Table 1
Three Treatments Against a Reference Source of Sum of Mean
Condition Variance Squares df Square F p
Group Mean Standard Deviation Between groups 536.000 3 178.667 14.105 < .001
Reference 50.00 3.55 Within groups 152.000 12 12.667
A 61.00 4.24 Total 688.000 15
B 52.00 2.45
C 45.00 3.74
meaning that the 95% CI for Group A is
399
400 Ecological Validity
be taken into account when evaluating ecological Commonly used outcome measures to which tradi-
validity. For example, the Grocery List Selective tional neuropsychological tests are correlated in
Reminding Test is a test that uses real-world stim- the veridicality approach are the Dysexecutive
uli. Unlike traditional paired associate or list- Questionnaire (DEX) and the Behavior Rating
learning tests, which often use arbitrary stimuli, Inventory of Executive Functioning.
the Grocery List Selective Reminding Test employs One limitation of the veridicality approach is
a grocery list to evaluate verbal learning. Naturally that the outcome measures selected for comparison
occurring stimuli increase the ecological validity of with the traditional neuropsychological test may
neuropsychological tests. not accurately represent the client’s everyday func-
tioning. Also, many of the traditional neuropsy-
Behavioral Response chological tests evaluated using the veridicality
approach were developed to diagnose brain
Another important dimension of ecological pathology, not make predictions about daily
validity is assuring that behavioral responses eli- functioning.
cited are representative of the person’s natural
behaviors and appropriately related to the con- Verisimilitude
struct being measured. Increased levels of ecologi-
cal validity would be represented in simulator Verisimilitude is the degree to which tasks per-
assessment of driving by moving the cursor with formed during testing resemble tasks performed in
the arrow keys, with the mouse, or with a steering daily life. With the verisimilitude approach, tests
wheel. The more the response approximates the are created to simulate real-world tasks. Some lim-
criterion, the greater the ecological validity. itations of the verisimilitude approach include the
The two main methods of establishing ecologi- cost of creating new tests and the reluctance of
cal validity are veridicality and verisimilitude. clinicians to put these new tests into practice. Mere
These methods are related to, but not isomorphic face validity cannot be substituted for empirical
with, the traditional constructs of concurrent research when assessing the ecological validity of
validity/predictive validity and construct validity/ neuropsychological tests formed from this
face validity, respectively. approach.
Wisconsin Card Sorting Test are capable of effec- suitable ecological validity to aid the diagnosis and
tively predicting occupational status only. Newer outcome prediction of patients with epilepsy.
tests are being developed to encompass verisimili-
tude in the study of executive functioning. These Perception
tests include the Virtual Planning Test and the
Behavioral Assessment of Dysexecutive Syndrome. Research on ecological validity of perceptual
tests is limited. The Behavioral Inattention Test
was developed to assist the prediction of everyday
Attention problems arising from unilateral visual neglect.
Although research on the veridicality of tests of Ecological validity of the Wechsler Adult Intelli-
attention is limited, there is reasonable evidence gence ScaleRevised has been shown for assessing
that traditional tests of attention are ecologically visuoconstructive skills. Investigators used subtests
valid. More research should be conducted to verify such as Block Design, Object Assembly, and Pic-
current results, but the ecological validity of these ture Completion and found that poor performance
traditional tests is promising. Although some predicts problems in daily living.
investigators are not satisfied with traditional tests
for attention deficit/hyperactivity disorder Virtual Tests
(ADHD), researchers have found evidence of pre-
With increasing advances in cyber technology,
dictive validity in the Hayling test in children with
neuropsychological assessments are turning to
ADHD. The Test of Everyday Attention (TEA),
computers as an alternative to real-world behav-
which was developed using the verisimilitude
ioral observations. One innovative approach has
approach, is an assessment tool designed to evalu-
been the use of virtual reality scenarios where sub-
ate attentional switching, selective attention, and
jects are exposed to machines that encompass 3-D,
sustained attention. Investigators have found cor-
real-world-like scenes and are asked to perform
relations between the TEA and other standardized
common functions in these environments allowing
measures of attention, including the Stroop Color-
naturalistic stimulus challenges while maintaining
Word Test, the Symbol Digit Modalities Test, and
experimental control. These tests include the Vir-
the Paced Auditory Serial Addition Tests.
tual Reality Cognitive Performance Assessment
Test, a virtual city, the Virtual Office, and a simu-
Memory Tests lated street to assess memory and executive func-
tioning. These methods suggest that the virtual
The Rivermead Behavioral Memory Test
tests may provide a new, ecological measure for
(RBMT), designed using the verisimilitude
examining memory deficits in patients.
approach, is a standardized test used to assess
everyday memory functioning. The memory tasks
in the RBMT resemble everyday memory demands, Other Applications
such as remembering a name or an appointment.
Academic Tests
Significant correlations have been demonstrated
between the RBMT and other traditional memory There is a high degree of variance in our educa-
tests as well as between the RBMT and ratings of tional systems, from grading scales and curricula
daily functioning by subjects, significant others, to expectations and teacher qualifications. Because
and clinicians. Some studies have revealed the supe- of such high variability, colleges and graduate pro-
riority of the RBMT and the TEA in predicting grams use standardized tests as part of their admis-
everyday memory functioning or more general sions procedure. Investigators have found that the
functioning when compared to more traditional American College Test has low predictive validity
neuropsychological tests. In addition to the RBMT, of first-year grades as well as graduation grades
other tests that take verisimilitude into account for students attending undergraduate programs.
include the 3-Objects-3-Places, the Process Dissoci- Also, correlations have been found between Scho-
ation Procedure, and the Memory in Reality. lastic Assessment Test Math (SATM) and Scholas-
Research also suggests that list learning tasks have tic Assessment Test Verbal (SATV) test scores and
402 Ecological Validity
overall undergraduate GPA, but the SATM may management movement, which uses work environ-
underpredict women’s grades. ments as rehabilitation sites instead of vocational
Studies suggest that the Graduate Record Exam rehabilitation centers, emerged.
(GRE) is capable of at least modestly predicting The Behavioral Assessment of Vocational
first-year grades in graduate school and veterinary Skills is a performance-based measure able to sig-
programs as well as graduate grade point average, nificantly predict workplace performance. Also,
faculty ratings, comprehensive examination scores, studies have shown that psychosocial variables sig-
citation counts and degree attainment across nificantly predict a patient’s ability to function
departments, acceptance into PhD programs, exter- effectively at work. When predicting employment
nal awards, graduation on time, and thesis publica- status, the Minnesota Multiphasic Personality
tion. Although some studies suggest that the GRE Inventory is one measure that has been shown to
is an ecologically valid tool, there is debate about add ecological validity to neuropsychological test
how much emphasis to place on GRE scores in the performance.
postgraduate college admissions process. The Med-
ical College Admission Test (MCAT) has been
Employment
shown to be predictive of success on written tests
assessing skills in clinical medicine. In addition, the Prediction of a person’s ability to resume
MCAT was able to positively predict performance employment after disease or injury has become
on physician certification exams. increasingly important as potential employers turn
to neuropsychologists with questions about job
Activities of Daily Living capabilities, skills, and performance. Ecological
validity is imperative in assessment of employabil-
To evaluate patients’ ability to function inde-
ity because of the severe consequences of inaccu-
pendently, researchers have investigated the accu-
rate diagnosis. One promising area of test
racy of neuropsychological tests in predicting
development is simulated vocational evaluations
patients’ capacities to perform activities of daily
(SEvals), which ask participants to perform a vari-
living (ADL), such as walking, bathing, dressing,
ety of simulated vocational tasks in environments
and eating. Research studies found that neuropsy-
that approximate actual work settings. Research
chological tests correlated significantly with cogni-
suggests that the SEval may aid evaluators in
tive ADL skills involving attention and executive
making vocational decisions. Other attempts at
functioning. Overall, ADL research demonstrates
improving employment predictions include the
low to moderate levels of ecological validity. Eco-
Occupational Abilities and Performance Scale and
logical validity is improved, however, when the
two self-report questionnaires, the Work Adjust-
ADLs evaluated have stronger cognitive compo-
ment Inventory and the Working Inventory.
nents. Driving is an activity of daily living that has
been specifically addressed in ecological validity
literature, and numerous psychometric predictors Forensic Psychology
have been identified. But none does better at pre-
Forensic psychology encompasses a vast spec-
diction than the actual driving of a small-scale
trum of legal issues including prediction of recidi-
vehicle on a closed course. In like manner, a wheel-
vism, identification of malingering, and assessment
chair obstacle course exemplifies an ecologically
of damages in personal injury and medico-legal
valid outcome measure for examining the outcome
cases. There is little room for error in these predic-
of visual scanning training in persons with right
tions as much may be at stake.
brain damage.
Multiple measures have been studied for the
prediction of violent behavior, including (a) the
Vocational Rehabilitation
Psychopathic checklist, which has shown predictive
Referral questions posed to neuropsychologists ability in rating antisocial behaviors such as crimi-
have shifted from diagnostic issues to rehabilitative nal violence, recidivism, and response to correc-
concerns. In an effort to increase the ecological tional treatment; (b) the MMPI-2, which has
validity of rehabilitation programs, the disability a psychopathy scale (scale 4) and is sensitive to
Ecological Validity 403
antisocial behavior; (c) the California Psychological delineate the relationship between particular cogni-
Inventory, which is a self-report questionnaire that tive constructs and more specific everyday abilities
provides an estimate of compliance with society’s involving those constructs may increase the ecologi-
norms; and (d) the Mental Status Examination, cal validity of neuropsychological tests. However,
where the evaluator obtains personal history infor- there is some disagreement as to which tests appro-
mation, reactions, behaviors, and thought pro- priately measure various cognitive constructs. Pres-
cesses. In juveniles, the Youth Level of Service Case ently, comparing across ecological validity research
Management Inventory (YLS/CMI) has provided studies is challenging because of the wide variety of
significant information for predicting recidivism in outcome measures, neuropsychological tests, and
young offenders; however, the percentage variance populations assessed.
predicted by the YLS/CMI was low. Great strides have been made in understanding
Neuropsychologists are often asked to deter- the utility of traditional tests and developing new
mine a client’s degree of cognitive impairment after and improved tests that increase psychologists’
a head injury so that the estimated lifetime impact abilities to predict people’s functioning in everyday
can be calculated. In this respect, clinicians must life. As our understanding of ecological validity
be able to make accurate predictions about the increases, future research should involve more
severity of the cognitive deficits caused by the encompassing models, which take other variables
injury. into account aside from test results. Interviews
with the client’s friends and family, medical and
employment records, academic reports, client com-
Limitations and Implications for the Future
plaints, and direct observations of the client can be
In addition to cognitive capacity, other variables helpful to clinicians faced with ecological ques-
that influence individuals’ everyday functioning tions. In addition, ecological validity research
include environmental cognitive demands, compen- should address test environments, environmental
satory strategies, and noncognitive factors. These demands, compensatory strategies, noncognitive
variables hinder researchers’ attempts at demon- factors, test and outcome measure selection, and
strating ecological validity. With regard to environ- population effects in order to provide a foundation
mental cognitive demands, for example, an from which general conclusions can be drawn.
individual in a more demanding environment will
demonstrate more functional deficits in reality than William Drew Gouvier, Alyse A. Barker,
an individual with the same cognitive capacity in and Mandi Wilkes Musso
a less demanding environment. To improve ecologi-
See also Concurrent Validity; Construct Validity; Face
cal validity, the demand characteristics of an indivi-
Validity; Predictive Validity
dual’s environment should be assessed. Clients’
consistency in their use of compensatory strategies
across situations will also affect ecological validity. Further Readings
Clinicians may underestimate a client’s everyday Chaytor, N., & Schmitter-Edgecombe, M. (2003). The
functional abilities if compensatory strategies are ecological validity of neuropsychological tests: A
not permitted during testing or if the client simply review of the literature on everyday cognitive skills.
chooses not to use his or her typical repertoire of Neuropsychology Review, 13(4), 181197.
compensatory skills during testing. Also, noncogni- Farias, S. T., Harrell, E., Neumann, C., & Houtz, A.
tive factors, including psychopathology, malinger- (2003). The relationship between neuropsychological
ing, and premorbid functioning, impede the performance and daily functioning in individuals with
predictive ability of assessment instruments. Alzheimer’s disease: Ecological validity of
neuropsychological tests. Archives of Clinical
A dearth of standardized outcome measures, var-
Neuropsychology, 18, 655672.
iable test selection, and population effects are other Farmer, J., & Eakman, A. (1995). The relationship
limitations of ecological validity research. Mixed between neuropsychological functioning and
results in current ecological validity literature may instrumental activities of daily living following
be a result of using inappropriate outcome mea- acquired brain injury. Applied Neuropsychology, 2,
sures. More directed hypotheses attempting to 107115.
404 Effect Coding
SSregression SSregression
R2Y:1;...; J ¼ ¼ : ð7Þ If the dispersion of the means around the grand
SSregression þ SSerror SStotal
mean is due only to random fluctuations, then the
SSbetween and the SSwithin should be commensura-
Significance Test ble. Specifically, the null hypothesis of no effect
can be evaluated with an F ratio computed as
In order to assess the significance of a given
R2Y:1;...; J , we can compute an F ratio as SSbetween N K
F ¼ × : ð11Þ
SSwithin K1
R2Y:1;...; J NJ1
F ¼ × : ð8Þ Under the usual assumptions of normality of the
1 R2Y:1;...; J J
error and of independence of the error and the
Under the usual assumptions of normality of the scores, this F ratio is distributed under the null
error and of independence of the error and the hypothesis as a Fisher distribution with ν1 ¼ K 1
scores, this F ratio is distributed under the null and ν2 ¼ N K degrees of freedom. If we denote
hypothesis as a Fisher distribution with ν1 ¼ J and by R2experimental the following ratio
ν2 ¼ N J 1 degrees of freedom.
SSbetween
R2experimental ¼ ‚ ð12Þ
SSbetween þ SSwithin
Analysis of Variance Framework
we can re-express Equation 11 in order to show its
For an ANOVA, the goal is to compare the means similarity with Equation 8 as
of several groups and to assess whether these
means are statistically different. For the sake of R2experimental NK
simplicity, we assume that each experimental F ¼ × : ð13Þ
1 R2experimental K1
group comprises the same number of observations
denoted I (i.e., we are analyzing a ‘‘balanced
design’’). So, if we have K experimental groups Analysis of Variance With Effect
with a total of I observations per group, we have
Coding Multiple Linear Regression
a total of K × I ¼ N observations denoted Yi;k .
The first step is to compute the K experimental The similarity between Equations 8 for MLR
means denoted Mþ;k and the grand mean denoted and 13 for ANOVA suggests that these two
Mþ;þ . The ANOVA evaluates the difference methods are related, and this is indeed the case.
between the mean by comparing the dispersion of In fact, the computations for an ANOVA can be
the experimental means to the grand mean (i.e., performed with MLR via a judicious choice of
406 Effect Coding
the matrix X (the dependent variable is repre- Table 2 ANOVA Table for the Data From Table 1
sented by the vector y). In all cases, the first col- Source df SS MS F Pr(F)
umn of X will be filled with 1s and is coding for Experimental 3 150.00 50.00 10.00 .0044
the value of the intercept. One possible choice Error 8 40.00 5.00
for X, called mean coding, is to have one addi-
tional column in which the value for the nth Total 11 190.00
observation will be the mean of its group. This
approach provides a correct value for the sums
of squares but not for the F (which needs to be In order to perform an MLR analysis, the data
divided by K 1). Most coding schemes will use from Table 1 need to be ‘‘vectorized’’ in order to
J ¼ K 1 linearly independent columns (as provide the following y vector:
many columns as there are degrees of freedom 2 3
20
for the experimental sum of squares). They all 6 17 7
give the same correct values for the sums of 6 7
6 17 7
squares and the F test but differ for the values of 6 7
6 21 7
the intercept and the slopes. To implement effect 6 7
6 16 7
coding, the first step is to select a group called 6 7
6 14 7
the contrasting group; often, this group is the y ¼ 6 7
6 17 7: ð14Þ
last one. Then, each of the remaining J groups is 6 7
6 16 7
contrasted with the contrasting group. This is 6 7
6 15 7
implmented by creating a vector for which all 6 7
6 87
elements of the contrasting group have the value 6 7
4 11 5
1, all elements of the group under consider-
ation have the value of þ 1, and all other ele- 8
ments have a value of 0.
With the effect coding scheme, the intercept is In order to create the N ¼ 12 by J þ 1 ¼ 3 þ
equal to the grand mean, and each slope coefficient 1 ¼ 4X matrix, we have selected the fourth experi-
is equal to the difference between the grand mean mental group to be the contrasting group. The first
and the mean of the group whose elements were column of X codes for the intercept and is composed
coded with values of 1. This difference estimates only of 1s. For the other columns of X, the values
the experimental effect of this group, hence the for the observations of the contrasting group will all
name of effect coding for this coding scheme. The be equal to 1. The second column of X will use
mean of the contrasting group is equal to the inter- values of 1 for the observations of the first group,
cept minus the sum of all the slopes. the third column of X will use values of 1 for the
observations of the second group, and the fourth col-
umn of X will use values of 1 for the observations of
Example the third group:
2 3
The data used to illustrate effect coding are shown 1 1 0 0
in Table 1. A standard ANOVA would give the 61 1 0 07
6 7
results displayed in Table 2. 61 1 0 07
6 7
61 0 1 0 7
6 7
Table 1 A Data Set for an ANOVA 61 0 1 07
6 7
a1 a2 a3 a4 61 0 1 07
6
X ¼ 6 7: ð15Þ
S1 20 21 17 8 61 0 0 177
S2 17 16 16 11 61 0 0 1 7
6 7
S3 17 14 15 8 61 0 0 17
6 7
Ma. 18 17 16 9 M:: ¼ 15 6 1 1 1 1 7
6 7
4 1 1 1 1 5
Note: A total of N ¼ 12 observations coming from K ¼ 4
groups with I ¼ 3 observations per group. 1 1 1 1
Effect Size, Measures of 407
example, a researcher may want to find out corresponding numbers are 0.2, 0.5, and 0.8,
whether smokers have greater chances of having respectively. In the example above, the effect size
lung cancer compared to nonsmokers. He or she of 0.27 is small, which indicates that the difference
may do a meta-analysis with studies reporting between reading and writing exam scores for boys
how many patients, among smokers and nonsmo- and girls is small.
kers, were diagnosed with lung cancer. The odds
ratio is appropriate to use when the report is for Glass’s g
a single study. One can compare study results by Similarly, Glass proposed an effect size estima-
investigating odds ratios for all of these studies. tor using a control group’s standard deviation to
The other commonly used effect size in meta- E E C
analysis is the correlation coefficient. It is a more standardize mean difference: g0 ¼ x sxC . Here, x
direct approach to tell the association between and xC are the sample means of an experimental
two variables. group and a control group, respectively, and sC is
the standard deviation of the control group.
Cohen’s d This effect size assumes multiple treatment com-
Cohen’s d is defined as the population means dif- parisons to the control group, and that treatment
ference divided by the common standard deviation. standard deviations differ from each other.
This definition is based on the t-test on means and
Hedge’s g
can be interpreted as the standardized difference
between two means. Cohen’s d assumes equal vari- However, neither Cohen’s d nor Glass’s g takes
ance of the two populations. For two independent the sample size into account, and the equal popu-
samples, it can be expressed as d ¼ mA m σ
B
for lation variance assumption may not hold. Hedge
a one-tailed effect size index and d ¼ jmA m Bj
for proposed a modification to estimate effect size as
σ
a two-tailed effect size index. Here, mA and mB are
two population means in their raw scales, and σ is xE xC
g ¼ ; where
the standard deviation of either population (both s
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
population means have equal variance). Because the 2 2
ðnE 1ÞðsE Þ ðnC 1ÞðsC Þ
population means and standard deviations are usu- s ¼ :
ally unknown, sample means and standard devia- nE þ nC 2
tions are used to estimate Cohen’s d. One-tailed and
two-tailed effect size index for t-test of means in Here, nE and nC are sample sizes of treatment and
control, and sE and sC are sample standard devia-
standard units are d ¼ xA x s
B
and d ¼ jxA x s
Bj
,
tions of treatment and control. Comparing to above
where xA and xB are sample means, and s is the
effect size estimators, Hedges’s g uses pooled sample
common standard deviation of both samples.
standard deviations to standardize mean difference.
For example, a teacher wanted to know
However, the above estimator has a small sam-
whether the ninth-grade boys or the ninth-grade
ple bias. An approximate unbiased estimator of
girls in her school were better at reading and writ-
effect size defined by Hedges and Olkin is
ing. She randomly selected 10 boys and 10 girls
from all ninth-grade students and obtained the xE xC 3
reading and writing exam score means of all boys g ¼ ð1 Þ; where
s 4N 9
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and girls, say, 67 and 71, respectively. She found
2 2
that the standard deviations of both groups were ðnE 1ÞðsE Þ ðnC 1ÞðsC Þ
s ¼ :
15. Then for a two-tailed test, the effect size is nE þ nC 2
jxA xB j j67 71j Here, N is the total sample size of both groups.
d ¼ ¼ ¼ 0:27: Especially when sample sizes in treatment and
s 15
control are equal, this estimator is the unique min-
Cohen used the terms small, medium, and large imum variance unbiased estimator. Many meta-
to represent relative size of effect sizes. The analysis software packages like Metawin use
Effect Size, Measures of 409
The scale is different from the effect sizes of con- is the total variance of the dependent variable and
tinuous variables, such as Cohen’s d and Hedge’s g, σ 2RL is the variance explained by other variables.
so it is not appropriate to compare the size of the The range of the squared correlation coefficient is
odds ratio with the effect sizes described above. from 0 to 1. It can be interpreted as the proportion
of variance shared by two variables. For example,
Pearson Correlation Coefficient (r) an r2 of 0.35 means that 35% of the total variance
is shared by two variables.
The Pearson correlation coefficient (r) is also
a popular effect size. It was first introduced by
Eta-Squared (η2 ) and Partial Eta-Squared (η2p )
Karl Pearson to measure the strength of the rela-
tionship between two variables. The range of the Eta-squared and partial eta-squared are effect
Pearson correlation coefficient is from 1 to 1. sizes used in ANOVAs to measure degree of associ-
Cohen gave general guidelines for the relative sizes ation in a sample. The effect can be the main effect
of the Pearson correlation coefficient as small, r ¼ or interaction in an analysis of variance model. It
0.1; medium, r ¼ 0.3; and large, r ¼ 0.5. Many is defined as the sum of squares of the effect versus
statistical packages, such as SAS and IBMâ SPSSâ the sum of squares of the total. Eta-squared can be
(PASW) 18.0 (an IBM company, formerly called interpreted as the proportion of variability caused
PASWâ Statistics), and Microsoft’s Excel can by that effect for the dependent variable. The
compute the Pearson correlation coefficient. range of an eta-squared is from 0 to 1. Suppose
Many meta-analysis software packages, such as there is a study of the effects of education and
Metawin and Comprehensive Meta-Analysis, experience on salary, and that the eta-squared of
allow users to use correlations as data and calcu- the education effect is 0.35. This means that 35%
late effect size with Fisher’s z transformation. The of the variability in salary was caused by
transformation formula is z ¼ 12 lnð11rþr
Þ, where r education.
is the correlation coefficient. Eta-squared is additive, so the eta-squared of all
The square of the correlation coefficient is also effects in an ANOVA model sums to 1. All effects
an effect size used to measure how much variance include all main effects and interaction effects, as
is explained by one variable versus the total vari- well as the intercept and error effect in an ANOVA
ance. It is discussed in detail below. table.
410 Effect Size, Measures of
Partial eta-squared is defined as the sum of explanatory variables, to 1, meaning all variances
squares of the effect versus the sum of squares can be explained by explanatory variables.
of the effect plus error. For the same effect in
the same study, partial eta-squared is always ω and Cramer’s V
larger than eta-squared. This is because the
The effect sizes ω and Cramer’s V are often used
denominator in partial eta-squared is smaller
for categorical data based on chi-square. Chi-
than that in eta-squared. For the previous
square is a nonparametric statistic used to test
example, the partial eta-squared may be 0.49
potential difference among two or more categori-
or 0.65; it cannot be less than 0.35. Unlike eta-
cal variables. Many statistical packages, such as
squared, the sum of all effects’ partial eta-
SAS and SPSS, show this statistic in output.
squared may not be 1, and in fact can be larger
The effect size ω can be calculated from chi-
than 1. qffiffiffiffi
2
The statistical package SPSS will compute and square and total sample size N as ω ¼ χN .
print out partial eta-squared as the effect size for However, this effect size is used only in the cir-
analysis of variance. cumstances of 2 × 2 contingency tables. Cohen
gave general guidelines for the relative size of
Omega Squared (ω2 ) ω: 0.1, 0.3, and 0.5 represent small, medium,
Omega squared is an effect size used to mea- and large effect sizes, respectively.
sure the degree of association in fixed and ran- For a table size greater than 2 × 2, one can use
dom effects analysis of variance study. It is the Cramer’s V (sometimes called Cramer’s ’) as the
relative reduction variance caused by an effect. effect size to measure the strength of association.
Unlike eta-squared and partial eta-squared, Popular statistical software such as SAS and SPSS
omega squared is an estimate of the degree of can compute this statistic. One can also calculate
association in a population, instead of in it from the chi-square statistic using the formula
qffiffiffiffiffiffiffi
a sample. V ¼ NL χ2
, where N is the total sample size and
Intraclass Correlation (ρ2I ) L equals the number of rows minus 1 or the num-
ber of columns minus 1, whichever is less. The
Intraclass correlation is also an estimate of the effect size of Cramer’s V can be interpreted as the
degree of association in a population in random average multiple correlation between rows and
effects models, especially in psychological studies. columns. In a 2 × 2 table, Cramer’s V is equal to
The one-way intraclass correlation coefficient is the correlation coefficient. Cohen’s guideline for ω
defined as the proportion of variance of random is also appropriate for Cramer’s V in 2 × 2 contin-
effect versus the variance of this effect and error var- gency tables.
MS MSerror
iance. One estimator is ρ^2I ¼ MS effect ,
effect þ dfeffect MSerror
where MSeffect and MSerror are mean squares of the Reporting Effect Size
effect and error, that is, the mean squares of
between-group and within-group effects. Reporting effect size in publications, along with
the traditional null hypothesis test, is important.
The null hypothesis test tells readers whether an
Other Effect Sizes effect exists, but it won’t tell readers whether the
results are replicable without reporting an effect
size. Research organizations such as the American
R2 in Multiple Regression
Psychological Association suggest reporting effect
As with the other effect sizes discussed in the size in publications along with significance tests. A
regression or analysis of variance sections, R2 is general rule for researchers is that they should at
a statistic used to represent the portion of variance least report descriptive statistics such as mean and
explained by explanatory variables versus the total standard deviation. Thus, effect size can be calcu-
variance. The range of R2 is from 0, meaning no lated and used for meta-analysis to compare with
relation between the dependent variable and other studies.
Endogenous Variables 411
See also Analysis of Variance (ANOVA); Chi-Square Test; The Problem of Endogeneity
Correlation; Hypothesis; Meta-Analysis
One of the most commonly used statistical models
is ordinary least squares regression (OLS). A vari-
ety of assumptions must hold for OLS to be the
Further Readings best unbiased estimator, including the indepen-
dence of errors. In regression models, problems
Cohen, J. (1988). Statistical power analysis for the with endogeneity may arise when an independent
behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence variable is correlated with the error term of an
Erlbaum.
endogenous variable. When observational data are
Hedges, L. V., & Olkin, I. (1985). Statistical methods for
meta-analysis. Orlando, FL: Academic Press.
used, as is the case with many studies in the social
Iversen, G. R., & Norpoth, H. (1976). Analysis of sciences, problems with endogeneity are more
variance (Sage University Paper Series on Quantitative prevalent. In cases where randomized, controlled
Applications in the Social Sciences, 07001). Beverly experiments are possible, such problems are often
Hills, CA: Sage. avoided.
Kirk, R. E. (1982). Experimental design: Procedures for Several sources influence problems with endo-
the behavioral sciences (2nd ed.). Belmont, CA: geneity: when the true value or score of a variable
Brooks/Cole. is not actually observed (measurement error),
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta- when a variable that affects the dependent variable
analysis. Thousand Oaks, CA: Sage.
is not included in the regression, and when recur-
Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A.,
& Song, F. (2000). Methods for meta-analysis in
sivity exists between the dependent and indepen-
medical research (2nd ed.). London: Wiley. dent variables (i.e., there is a feedback loop
Volker, M. A. (2006). Reporting effect size estimates in between the dependent and independent variables).
school psychology research. Psychology in the Each of these sources may occur alone or in con-
Schools, 43(6), 653672. junction with other sources.
412 Error
Further Readings
A B C
Bound, J., Jaeger, D. A., & Baker, R. M. (1995).
Problems with instrumental variables estimation when
the correlation between the instruments and the
Figure 1 Variables B and C Are Endogenous explanatory variable is weak. Journal of the American
Statistical Association, 90, 443450.
Kennedy, P. (2008). A guide to econometrics. Malden,
The solution to problems with endogeneity is MA: Blackwell.
often to use instrumental variables. Instrumental Kline, R. B. (2005). Principles and practice of structural
equation modeling (2nd ed.). New York: Guilford.
variables methods include two-stage least squares,
limited information maximum likelihood, and
jackknife instrumental variable estimators. Advan-
tages of instrumental variables estimation include
the transparency of procedures and the ability to ERROR
test the appropriateness of instruments and the
degree of endogeneity. Instrumental variables are Error resides on the statistical side of the fault line
beneficial only when they are strongly correlated separating the deductive tools of mathematics
with the endogenous variable and when they are from the inductive tools of statistics. On the math-
exogenous to the model. ematics side of the chasm lays perfect information,
and on the statistics side exists estimation in the
face of uncertainty. For the purposes of estimation,
error describes the unknown, provides a basis for
comparison, and serves as a hypothesized place-
Endogenous Variables in
holder enabling estimation. This entry discusses
Structural Equation Modeling the role of error from a modeling perspective and
In structural equation modeling, including path in the context of regression, ordinary least squares
analysis, factor analysis, and structural regression estimation, systematic error, random error, error
models, endogenous variables are said to be distributions, experimentation, measurement error,
‘‘downstream’’ of either exogenous variables or rounding error, sampling error, and nonsampling
other endogenous variables. Thus, endogenous error.
variables can be both cause and effect variables.
Consider the simple path model in Figure 1.
Variable A is exogenous; it does not have any vari- Modeling
ables causally prior to it in the model. B is endoge-
For practical purposes, the universe is stochastic.
nous; it is affected by the exogenous variable A
For example, any ‘‘true’’ model involving gravity
while affecting C. C is also an endogenous vari-
would require, at least, a parameter for every par-
able, directly affected by B and indirectly affected
ticle in the universe. One application of statistics is
by A.
to quantify uncertainty. Stochastic or probabilistic
As error associated with the measurement of
models approximate relationships within some
endogenous variables can bias standardized
locality that contains uncertainty. That is, by hold-
direct effects on endogenous variables, structural
ing some variables constant and constraining
equation modeling uses multiple measures of
others, a model can express the major relation-
latent constructs in order to address measure-
ships of interest within that locality and amid an
ment error.
acceptable amount of uncertainty. For example,
Kristin Floress a model describing the orbit of a comet around the
sun might contain parameters corresponding to
See also Exogenous Variables; Latent Variable; Least the large bodies in the solar system and account
Squares, Methods of; Regression to the Mean; for all remaining gravitational pulls with an error
Structural Equation Modeling term.
Error 413
Model equations employ error terms to repre- remaining steps in building least squares regres-
sent uncertainty or the negligible contributions. sion, which combines two concepts involving
Error terms are often additive or multiplicative errors: Estimate the coefficients by minimizing
placeholders, and models can have multiple error
ε2i ¼ 0, and assume that εij IID Nðo; σ 2 Þ. This
e
terms. progression of statistical innovations has culminated
in a family of regression techniques incorporating
Additive: E ¼ MC2 þ ε, where ε is an error term a variety of estimators and error assumptions.
perfecting the equation Statistical errors are the placeholders represent-
Multiplicative: y ¼ α þ xβε ing that which remains unquantified or inconsis-
tent in a hypothesized relationship. In assuming
Other: y ¼ eβðx þ εME Þ þ ε, where εME is
that these inconsistencies behave reasonably,
measurement error corresponding to x, and ε is an
additive error term.
researchers are able to find reasonable solutions.
The traditional modeling problem is to solve a set Ordinary least squares (OLS) estimates are derived
of inconsistent equations—characterized by the from fitting one equation to explain a set of incon-
presence of more equations than unknowns. Early sistent equations. There are basically six assump-
researchers cut their teeth on estimating physical tions implicit in OLS estimation, all of which
relationships in astronomy and geodesy—the study regard errors as follows:
of the size and shape of the earth—expressed
1. Misspecification error is negligible—the
by a set of k inconsistent linear equations of the functional form is reasonable and no significant
following form: xs are absent from the model.
1. Transform the response or regressors so that the The usual test statistic for this hypothesis is
distribution of errors is approximately normal. F ¼ MSMS
treatment
error
, which is the ratio of two estimators
Error 415
P 2
n ðy y Þ
i i: The best solution for both objectives is to
of σ 2ε .The numerator is MStreatment ¼ p1
::
, reduce the measurement error. This can be accom-
which is unbiased only if H0 is true. The denomi- plished in three ways:
P 2
ðyij yi: Þ
nator is MSerror ¼ Np , which is unbiased 1. Improve the measurement device, possibly
regardless of whether H0 is true. George W. Sne- through calibration.
decor recognized the value of this ratio and named
2. Improve the precision of the data storage
it the F statistic in honor of Ronald A. Fisher, who device.
was chiefly responsible for its derivation.
Difference is relative. The F test illustrates how 3. Replace x with a more accurate measure of the
error serves as a basis for comparison. If the treat- same characteristic, xM .
P 2
ni ðyi: y:: Þ
ment means, p1 , vary relatively more than The next most promising solution is to estimate
P 2 the measurement error and use it to ‘‘adjust’’ the
ðyij yi: Þ
the observations within each treatment, Np , parameter estimates and the confidence intervals.
then the statistician should infer that H0 is false. There are three approaches:
That is, if the discernible differences between the
treatment means are unusually large relative to the 1. Collect repeated measures of x on the same
unknown, σ 2ε , then the differences are more likely to observations, thereby estimating the variance of
be genuine. ANOVA is an analysis of means based the measurement error, σ 2ME , and using it to
adjust the regression coefficients and the
on analyzing variances of errors.
confidence intervals.
2. Calibrate x against a more accurate measure,
xM, which is unavailable for the broader
Measurement Error application, thereby estimating the variance of
the measurement error, σ 2ME .
Measurement error is the difference between the
3. Build a measurement error model based on
‘‘true’’ value and the measured value. This is some- a validation data set containing y and x
times called observational error. For many models, alongside the more accurate and broadly
one implied assumption is that the inputs and the unavailable xM. As long as the validation data
outputs are measured accurately enough for the set is representative of the target population, the
application. This is often false, especially with con- relationships can be extrapolated.
tinuous variables, which can only be as accurate as
the measurement and data storage devices allow. For the prediction problem, there is a third solu-
In practice, measurement error is often unstable tion for avoiding bias in the predictions, yet it does
and difficult to estimate, requiring multiple mea- not repair the biased coefficients or the ample con-
surements or independent knowledge. fidence intervals. The solution is to ensure that the
There are two negative consequences due to measurement error present when the model was
measurement error in the regressor, x. First, if the built is consistent as the model is applied.
measurement error variance is large relative to the
variability in x, then the coefficients will be biased.
Rounding Error
In a simple regression model for example, mea-
surement error in x will cause β ^0 to converge to Rounding error is often voluntary measurement
a slightly larger value than β0 and β ^ 1 to be ‘‘atten- error. The person or system causing the rounding
uated’’ that is, the measurement error shrinks β ^1 is now a second stage in the measurement device.
so that it will underestimate β1 . Second, if the Occasionally, data storage devices lack the same
measurement error in x is large relative to the vari- precision as the measurement device, and this cre-
ability of y, then this will increase the widths of ates rounding error. More commonly, people or
confidence intervals. Both of these problems inter- software collecting the information fail to retain
fere with the two primary objectives of modeling: the full precision of the data. After the data are
coefficient estimation and prediction. collected, it is common to find unanticipated
416 Error Rates
applications—the serendipity of statistics, wanting Kutner, M., Nachtsheim, C., Neter, J., & Li, W. (2005).
more precision. Applied linear statistical models (5th ed.). Boston:
Large rounding error, εR , can add unwelcome McGraw-Hill Irwin.
complexity to the problem. Suppose that x2 is Milliken, G., & Johnson, D. (1984). Analysis of messy
data: Vol. 1.: Designed experiments. Belmont, CA:
measured with rounding error, εR , then a model
Wadsworth.
involving x2 might look like this: Snedecor, G. (1956). Statistical methods applied to
experiments in agriculture and biology (5th ed.).
y ¼ β0 þ β1 x1 þ β2 ðx2 þ εR Þ þ ε: Ames: Iowa State College Press.
Stigler, S. (1986). The history of statistics: The
measurement of uncertainty before 1900.
Sampling and Nonsampling Error Cambridge, MA: Belknap Press of Harvard
The purpose of sampling is to estimate character- University Press.
istics (mean, variance, etc.) of a population Weisberg, S. (1985). Applied linear regression (2nd ed.).
New York: John Wiley.
based upon a randomly selected representative
subset. The difference between a sample’s esti-
mate and the population’s value is due to two
sources of error: sampling error and nonsam-
pling error. Even with perfect execution, there is
ERROR RATES
a limitation on the ability of the partial informa-
tion contained in the sample to fully estimate In research, error rate takes on different meanings
population characteristics. This part of the esti- in different contexts, including measurement and
mation difference is due to sampling error—the inferential statistical analysis. When measuring
minimum discrepancy due to observing a sample research participants’ performance using a task
instead of the whole population. Nonsampling with multiple trials, error rate is the proportion of
error explains all remaining sources of error, responses that are incorrect. In this manner, error
including nonresponse, selection bias, measure- rate can serve as an important dependent variable.
ment error (inaccurate response), and so on, that In inferential statistics, errors have to do with the
are related to execution. probability of making a false inference about the
Sampling error is reduced by improving the sam- population based on the sample data. Therefore,
ple design or increasing the sample size. Nonsam- estimating and managing error rates are crucial to
pling error is decreased through better execution. effective quantitative research.
This entry mainly discusses issues involving
Randy J. Bartlett error rates in measurement. Error rates in statisti-
cal analysis are mentioned only briefly because
See also Error Rates; Margin of Error; Missing Data, they are covered in more detail under other entries.
Imputation of; Models; ‘‘Probable Error of a Mean,
The’’; Random Error; Residual Plot; Residuals; Root
Mean Square Error; Sampling Error; Standard Error Rates in Measurement
Deviation; Standard Error of Estimate; Standard Error
In a task with objectively correct responses (e.g.,
of Measurement; Standard Error of the Mean; Sums of
a memory task involving recalling whether
Squares; Systematic Error; Type I Error; Type II Error;
a stimulus had been presented previously), a par-
Type III Error; Variability, Measure of; Variance;
ticipant’s response can be one of three possibili-
White Noise
ties: no response, a correct response, or an
incorrect response (error). Instances of errors
Further Readings across a series of trials are aggregated to yield
Harrell, F. E., Jr. (2001). Regression modeling strategies,
error rate, ideally in proportional terms. Specifi-
with applications to linear models, logistic regression, cally, the number of errors divided by the num-
and survival analysis. New York: Springer. ber of trials in which one has an opportunity to
Heyde, C. C., & Seneta, E. (2001). Statisticians of the make a correct response yields the error rate.
centuries. New York: Springer-Verlag. Depending on the goals of the study, researchers
Error Rates 417
may wish to use for the denominator the total Distance (divided by standard deviation yields d´)
number of responses or the total number of trials
(including nonresponses, if they are considered
relevant). The resulting error rate can then be
Noise distribution Signal + Noise distribution
used to test hypotheses about knowledge or cog-
nitive processes associated with the construct
represented by the targets of response.
Probability of a hit
alarm rate increase. Bias is sometimes expressed as influences that controlled and automatic processes
β, which is defined as the likelihood ratio of the have on the response are opposite to each other;
signal distribution to noise distribution at the crite- these trials are called incongruent trials. The goal
rion (i.e., the ratio of the height of the signal curve of PDP is to estimate the probabilities that con-
to the height of the noise curve at the value of the trolled and automatic processes affect responses.
0
threshold) and is equal to ed *; C . The value would
be greater than 1 when the perceiver is conserva-
tive and less than 1 when liberal. Sensitivity and Other Issues With Error Rates in Measurement
bias can be estimated using a normal distribution
function from hit rate and false alarm rate; Speed-Accuracy Trade-Off
because the two rates are independent of each In tasks measuring facility of judgments (e.g.,
other, it is necessary to obtain both of them from categorizing words or images), either error rate or
data. response latency can be used as the basis of analy-
sis. If the researcher wants to use error rates, it is
Process Dissociation Procedure desirable to have time pressure in the task in order
to increase error rates, so that larger variability in
Process Dissociation Procedure (PDP) is error rate can be obtained. Without time pressure,
a method that uses error rates to estimate the sepa- in many tasks, participants will make mostly accu-
rate contributions of controlled (intentional) and rate responses, and it will be hard to discern mean-
automatic (unintentional) processes in responses. ingful variability in response facility.
In tasks involving cognitive processes, participants
will consciously (intentionally) strive to make cor- Problems Involving High Error Rates
rect responses. But at the same time, there may
also be influences of automatic processes that are When something other than error rate is mea-
beyond conscious awareness or control. Using sured (e.g., response latency), error rate may be high
PDP, the researcher can estimate the independent for some or all participants. As a consequence, there
influences of controlled and automatic processes may be too few valid responses to use in analysis.
from error rates. To address this issue, if only a few participants have
The influences of controlled and automatic pro- error rates higher than a set criterion (ideally dis-
cesses may work hand in hand or in opposite cerned by a discontinuity in the frequency distribu-
directions. For example, in a typical Stroop color tion of error rates), the researcher may remove his
naming task, participants are presented with color or her data. However, if it is a more prevalent trend,
words (e.g., ‘‘red’’) and instructed to name the col- the task may be too difficult (in which case, the task
ors of the words’ lettering, which are either consis- may have to be made easier) or inappropriate (in
tent (e.g., red) or inconsistent (e.g., green) with the which case, the researcher should think of a better
words. In certain trials, the response elicited by the way to measure the construct).
automatic process is the correct response as
Error Rates as a Source of Error Variance
defined in the task. For example, when the stimu-
lus is the word ‘‘red’’ in red lettering, the auto- In measures with no objectively correct or
matic process (to read the word) will elicit the incorrect responses (e.g., Likert-scale ratings of
response ‘‘red,’’ which is the same as the one dic- attitudes or opinions), errors can be thought of as
tated by the controlled process (to name the color the magnitude of inaccuracy of the measurement.
of the lettering). Such trials are called congruent For example, when the wording of questionnaire
trials, because controlled and automatic processes items or the scale of a rating trial is confusing, or
elicit the same response. In other trials, the when some of the participants have response sets
response elicited by the automatic process is not (e.g., a tendency to arbitrarily favor a particular
the response required in the task. In our example, response option), the responses may not accurately
when the word ‘‘green’’ is presented in red letter- reflect what is meant to be measured. If error vari-
ing, the automatic process will elicit the response ance caused by peculiarities of the measurement or
‘‘green.’’ In this case, the directions of the of some participants is considerable, the reliability
Estimation 419
of the measurement is dubious. Therefore, the Luce, R. D. (1986). Response times: Their role in
researcher should strive to minimize these kinds of inferring elementary mental organization. New York:
errors, and check the response patterns within and Oxford University Press.
across participants to see if there are any nontriv- Sanders, A. F. (1998). Elements of human performance:
Reaction processes and attention in human skill.
ial, systematic trends not intended.
Mahwah, NJ: Lawrence Erlbaum.
Wickens, T. D. (2001). Elementary signal detection
theory. New York: Oxford University Press.
Errors in Statistical Inference
In statistical inference, the concept of error rates is
used in null hypothesis significance testing (NHST) ESTIMATION
to make judgments of how probable a result is in
a given population. Proper NHST is designed to
minimize the rates of two types of errors: Type II Estimation is the process of providing a numerical
and particularly Type I errors. value for an unknown quantity based on informa-
A Type I error (false positive) occurs when tion collected from a sample. If a single value is
a rejected null hypothesis is correct (i.e., an effect calculated for the unknown quantity, the process is
is inferred when, in fact, there is none). The proba- called point estimation. If an interval is calculated
bility of a Type I error is represented by the p that is likely, in some sense, to contain the quan-
value, which is assessed relative to an a priori cri- tity, then the procedure is called interval estima-
terion, α. The conventional criterion is α ¼ :05; tion, and the interval is referred to as a confidence
that is, when the probability of a Type I error (p) interval. Estimation is thus the statistical term for
is less than .05, the result is considered ‘‘statisti- an everyday activity: making an educated guess
cally significant.’’ Recently, there has been a grow- about a quantity that is unknown based on known
ing tendency to report the exact value of p rather information. The unknown quantities, which are
than merely stating whether it is less than α. Fur- called parameters, may be familiar population
thermore, researchers are increasingly reporting quantities such as the population mean μ, popula-
effect size estimates and confidence intervals so tion variance σ 2, and population proportion π. For
that there is less reliance on a somewhat arbitrary, instance, a researcher may be interested in the pro-
dichotomous decision based on the .05 criterion. portion of voters favoring a political party. That
A Type II error (false negative) occurs when proportion is the unknown parameter, and its esti-
a retained null hypothesis is incorrect (i.e., no mation may be based on a small random sample
effect is inferred when, in fact, there is one). An of individuals. In other situations, the parameters
attempt to decrease the Type II error rate (by being are part of more elaborate statistical models, such
more liberal in saying there is an effect) also as the regression coefficients β0 ; β1 ; . . . ; βp in a lin-
increases the Type I error rate, so one has to ear regression model
compromise.
X
p
a scalar, but the results below extend to the case moments, and solving for the unknown para-
that θ ¼ ðθ1 ; θ2 ; . . . ; θk Þ with k > 1. meters. Least-squares estimators are obtained,
To estimate θ, or, more generally, a real-valued particularly in regression analysis, by minimizing
function of θ, τ(θ), one calculates a corresponding a (possibly weighted) difference between the
function of the observations, a statistic, δ ¼ δ(X1, observed response and the value predicted by the
X2; . . . ; Xn). An estimator is any statistic δ defined model.
over the sample space. Of course, it is hoped that δ The method of maximum likelihood is the most
will tend to be close, in some sense, to the popular technique for deriving estimators. Consid-
unknown τ(θ), but such a requirement is not part ered for fixed x ¼ (x1, x2 ; . . . ; xn) as a function of
of the formal definition of an estimator. The value θ, the joint probability density (or probability)
δ(x1, x2; . . . ; xn) taken on by δ in a particular case pθ ðxÞ ¼ pθ ðx1 ; . . . ; xn Þ is called the likelihood of θ,
is the estimate of τ(θ), which will be our educated and the value θ^ ¼ θ(X) ^ of θ that maximizes pθ(x)
guess for the unknown value. In practice, the com- constitutes the maximum likelihood estimator
pact notation δ^ is often used for both estimator (MLE) of θ. The MLE of a function τ(θ) is defined
and estimate. ^
to be τðθÞ.
The theory of point estimation can be divided In Bayesian analysis, a distribution πðθÞ, called
into two parts. The first part is concerned with a prior distribution, is introduced for the parame-
methods for finding estimators, and the second part ter θ, which is now considered a random quantity.
is concerned with evaluating these estimators. The prior is a subjective distribution, based on the
Often, the methods of evaluating estimators will experimenter’s belief about θ, prior to seeing the
suggest new estimators. In many cases, there will be data. The joint probability density (or probability
an obvious choice for an estimator of a particular function) of X now represents the conditional dis-
parameter. For example, the sample mean is a natu- tribution of X given θ, and is written pðx j θÞ. The
ral candidate for estimating the population mean; conditional distribution of θ given the data x is
the median is sometimes proposed as an alternative. called the posterior distribution of θ, and by
In more complicated settings, however, a more sys- Bayes’s theorem, it is given by
tematic way of finding estimators is needed.
πðθ j xÞ ¼ πðθÞpðx j θÞ=mðxÞ‚ ð1Þ
Methods of Finding Estimators where m(x) Ris the marginal distribution of X, that
is, mðxÞ ¼ πðθÞpðx j θÞ dθ. The posterior distri-
The formulation of the estimation problem in
bution which combines prior information and
a concrete situation requires specification of the
information in the data, is now used to make state-
probability model, P, that generates the data. The
ments about θ: For instance, the mean or median
model P is assumed to be known up to an
of the posterior distribution can be used as a point
unknown parameter θ, and P ¼ Pθ is written to
estimate of θ: The resulting estimators are called
express this dependence. The observations x ¼
Bayes estimators.
(x1, x2; . . . ; xn) are postulated to be the values
taken on by the random observable X ¼ (X1,
X2; . . . ; Xn) with distribution Pθ . Frequently, it will
Example
be reasonable to assume that each of the Xis has
the same distribution, and that the variables X1, Suppose X1, X2; . . . ; Xn are i.i.d. Bernoulli ran-
X2; . . . ; Xn are independent. This situation is called dom variables, which take the value 1 with proba-
the independent, identically distributed (i.i.d.) case bility θ and 0 with probability 1 θ: A Bernoulli
in the literature and allows for a considerable sim- process results, for example, from conducting a sur-
plification in our model. vey to estimate the unemployment rate, θ: In this
There are several general-purpose techniques context, the value 1 denotes the responder was
for deriving estimators, including methods based unemployed. The first moment (mean) of the distri-
on moments, least-squares, maximum-likelihood, bution is θ and the likelihood function is given by
and Bayesian approaches. The method of moments ny
is based on matching population and sample pθ ðx1 ; . . . ; xn Þ ¼ θy ð1 θÞ ‚ 0 ≤ θ ≤ 1‚
Estimation 421
P
where y ¼ xi . The method of moments and The property of unbiasedness is an attractive
maximum-likelihood estimates of θ are both one, and much research has been devoted to the
θ^ ¼ y=n, that is, the intuitive frequency-based esti- study of unbiased estimators. For a large class of
mate for the probability of success given y suc- problems, it turns out that among all unbiased
cesses in n trials. For a Bayesian analysis, if the estimators, there exists one that uniformly mini-
prior distribution for the parameter θ is a Beta dis- mizes the variance for all values of the unknown
tribution, Beta(α‚ β), parameter, and which is therefore uniformly mini-
mum variance unbiased (UMVU). Furthermore,
πðθÞ / θ α1 ð1 θÞβ1 ‚ α‚ β > 0 one can specify a lower bound on the variance of
any unbiased estimator of θ, which can sometimes
the posterior distribution for θ, from (1), is be attained. The result is the following version of
the information inequality
πðθ j xÞ / θ α þ y1 ð1 θÞn þ βy1 :
^
var θðXÞ ≥ 1=IðθÞ‚ ð2Þ
The posterior distribution is also a Beta distri-
bution, θ j x ~ Beta ðα þ y; n þ β ¼ yÞ, and a Bayes where
estimate, based on, for example, the posterior ( 2 )
mean, is θ^ ¼ ðα þ yÞ=ðn þ β yÞ. ∂
IðθÞ ¼ E logpθ ðXÞ ð3Þ
∂θ
Methods of Evaluating Estimators
is the information (or Fisher information) that X
For any given unknown parameter, there are, in contains about θ: The bound can be used to obtain
general, many possible estimators, and methods to the (absolute) efficiency of an unbiased estimator θ^
distinguish between good and poor estimators are of θ: This is defined as
needed. The general topic of evaluating statistical
procedures is part of the branch of statistics ^ ¼ 1=IðθÞ
eðθÞ :
known as decision theory. The error in using the ^ θÞ
Vðθ;
observable θ^ ¼ θðXÞ
^ to estimate the unknown θ is
ε^ ¼ θ^ θ. This error forms the basis for assessing By Equation 2, the efficiency is bounded above
by unity; when eðθÞ ^ ¼ 1, for all θ, θ^ is said to be
the performance of an estimator. A commonly
used finite-sample measure of performance is the efficient. Thus, an efficient estimator, if it exists, is
mean squared error (MSE). The MSE of an estima- the UMVU, but the UMVU is not necessarily effi-
tor θ^ of a parameter θ is the function of θ defined cient. In practice, there is no universal method for
2 deriving UMVU estimators, but there are, instead,
^
by E θðXÞ θ where Eð · Þ denotes the expected a variety of techniques that can sometimes be
value of the expression in brackets. The advantage applied.
of the MSE is that it can be decomposed into a sys- Interestingly, unbiasedness is not essential,
tematic error represented by the square of the bias and a restriction to the class of unbiased estima-
B θ;^ θ ¼ E ½θðXÞ
^ θ and the intrinsic variability tors may rule out some very good estimators,
represented by the variance V θ; ^ θ ¼ var θðXÞ
^ . including maximum likelihood. It is sometimes
Thus, the case, for example, that a trade-off occurs
between variance and bias in such a way that
2 a small increase in bias can be traded for a larger
^ θ þ V θ;
E θ^ θ ¼ B2 θ; ^ θ :
decrease in variance, resulting in an improve-
ment in MSE. In addition, finding a best unbi-
An estimator whose bias B θ; ^ θ ¼ 0 is called
ased estimator is not straightforward. For
^
unbiased and satisfies E[θðXÞ ¼ θ for all θ, so instance, UMVU estimators, or even any unbi-
that, on average, it will estimate the right value. ased estimator, may not exist for a given τ(θ); or
For unbiased estimators, the MSE reduces to the the bound in Equation 2 may not be attainable,
^
variance of θ. and one then has to decide if one’s candidate for
422 Eta-Squared
best unbiased estimator is, in fact, optimal. of τðθÞ. There are other asymptotically optimal
Therefore, there is scope to consider other crite- estimators, such as Bayes estimators. The
ria also, and possibilities include equivariance, method of moments estimator is not, in general,
minimaxity, and robustness. asymptotically optimal but has the virtue of
In many cases in practice, estimation is per- being quite simple to use.
formed using a set of independent, identically dis- In most practical situations, it is possible to con-
tributed observations. In such cases, it is of interest sider the use of several different estimators for the
to determine the behavior of a given estimator as unknown parameters. It is generally good advice
the number of observations increases to infinity to use various alternative estimation methods in
(i.e., asymptotically). The advantage of asymptotic such situations, these methods hopefully resulting
evaluations is that calculations simplify and it is in similar parameter estimates. If a single estimate
also more clear how to measure estimator perfor- is needed, it is best to rely on a method that pos-
mance. Asymptotic properties concern a sequence sesses good statistical properties, such as maxi-
of estimators indexed by n, θ^n , obtained by per- mum likelihood.
forming the same estimation procedure for each
sample size. For example, X1 ¼ X1 , X2 ¼ Panagiotis Besbeas
ðX1 þ X2 Þ=2, X3 ¼ ðX1 þ X2 þ X3 Þ=3, and so
See also Accuracy in Parameter Estimation; Confidence
forth. A sequence of estimators θ^n is said to be
Intervals; Inference: Deductive and Inductive; Least
asymptotically optimal for θ if it exhibits the fol-
Squares, Methods of; Root Mean Square Error;
lowing characteristics:
Unbiased Estimator
to be aware of eta-squared’s limitations, which This index of the strength of association between
include an overestimation of population effects variables has been referred to as practical signifi-
and its sensitivity to design features that influence cance. Determination of the size of effect based on
its relevance and interpretability. Nonetheless, an eta-squared value is largely a function of the vari-
many social scientists advocate for the reporting of ables under investigation. In behavioral science, large
the eta-squared statistic, in addition to reporting effects may be a relative term.
statistical significance. Partial eta-squared (η2p ), a second estimate of
This entry focuses on defining, calculating, and effect size, is the ratio of variance due to an effect
interpreting eta-squared values, and will discuss to the sum of the error variance and the effect vari-
the advantages and disadvantages of its use. The ance. In a one-way ANOVA design that has just
entry concludes with a discussion of the literature one factor, the eta-squared and partial eta-squared
regarding the inclusion of eta-squared values as values are the same. Typically, partial eta-squared
a measure of effect size in the reporting of statisti- values are greater than eta-squared estimates, and
cal results. this difference becomes more pronounced with the
addition of independent factors to the design.
Some critics have argued that researchers incor-
Defining Eta-Squared
rectly use these statistics interchangeably. Gener-
Eta-squared (η2) is a common measure of effect size ally, η2 is preferred to η2p for ease of interpretation.
used in t tests as well as univariate and multivariate
analysis of variance (ANOVA and MANOVA,
Calculating Eta-Squared
respectively). An eta-squared value reflects the
strength or magnitude related to a main or interac- Statistical software programs, such as IBMâ SPSSâ
tion effect. Eta-squared quantifies the percentage of (PASW) 18.0 (an IBM company, formerly called
variance in the dependent variable (Y) that is PASWâ Statistics) and SAS, provide only the par-
explained by one or more independent variables tial eta-squared values in the output, and not the
(X). This effect tells the researcher what percentage eta-squared values. However, these programs pro-
of the variability in participants’ individual differ- vide the necessary values for the calculation of the
ences on the dependent variable can be explained eta-squared statistic. Using information provided
by the group or cell membership of the participants. in the ANOVA summary table in the output, eta-
This statistic is analogous to r-squared values in squared can be calculated as follows:
bivariate correlation (r2) and regression analysis
(R2). Eta-squared is considered an additive measure SSeffect
η2 ¼ :
of the unique variation in a dependent variable, SStotal
such that nonerror variation is not accounted for
by other factors in the analysis. When using between-subjects and within-sub-
jects designs, the total sum of squares (SStotal) in
the ratio represents the total variance. Likewise,
Interpreting the Size of Effects
the sum of squares of the effect (which can be
The value of η2 is interpretable only if the F ratio for a main effect or an interaction effect) represents
a particular effect is statistically significant. Without the variance attributable to the effect. Eta-squared
a significant F ratio, the eta-squared value is essen- is the decimal value of the ratio and is interpreted
tially zero and the effect does not account for any as a percentage. For example, if SStotal ¼ 800 and
significant proportion of the total variance. Further- SSA ¼ 160 (the sum of squares of the main effect
more, some researchers have suggested cutoff values of A), the ratio would be .20. Therefore, the inter-
for interpreting eta-squared values in terms of the pretation of the eta-squared value would be that
magnitude of the association between the indepen- the main effect of A explains 20% of the total
dent and dependent measures. Generally, assuming variance of the dependent variable. Likewise, if
a moderate sample size, eta-squared values of .09, SStotal ¼ 800 and SSA × B ¼ 40 (the sum of
.14, and .22 or greater could be described in the squares of the interaction between A and B), the
behavioral sciences as small, medium, and large. ratio would be .05. Given these calculations, the
424 Eta-Squared
explanation would be that the interaction between dependent variable that is shared with the grouping
A and B accounts for 5% of the total variance of variable for a particular sample. Thus, because eta-
the dependent variable. squared is sample-specific, one disadvantage of eta-
squared is that it may overestimate the strength of
the effect in the population, especially when the
Mixed Factorial Designs sample size is small. To overcome this upwardly
biased estimation, researchers often calculate an
When using a mixed-design ANOVA, or a design
omega-squared (ω2) statistic, which produces
that combines both between- and within-subject
a more conservative estimate. Omega-squared is an
effects (e.g., pre- and posttest designs), researchers
estimate of the dependent variable population vari-
have differing opinions regarding whether the
ability accounted for by the independent variable.
denominator should be the SStotal when calculating
the eta-squared statistic. An alternative option is to
use the between-subjects variance (SSbetween subjects) Design Considerations
and within-subjects variance (SSwithin subjects), sepa- In addition to the issue of positive bias in popu-
rately, as the denominator to assess the strength of lation effects, research design considerations may
the between-subjects and within-subjects effects, also pose a challenge to the use of the eta-squared
respectively. Accordingly, when considering such statistic. In particular, studies that employ a multi-
effects separately, eta-squared values are calculated factor completely randomized design should
using the following formulas: employ alternative statistics, such as partial eta-
SSA and omega-squared. In multifactor designs, partial
η2 ¼ ‚ eta-squared may be a preferable statistic when
SSwithin subjects
researchers are interested in comparing the
strength of association between an independent
SSB and a dependent variable that excludes variance
η2 ¼ ‚ and
SSbetween subjects from other factors or when researchers want to
compare the strength of association between the
same independent and dependent measures across
SSA × B studies with distinct factorial designs. The strength
η2 ¼ :
SSwithin subjects of effects also can be influenced by the levels cho-
sen for independent variables. For example, if
When using SSbetween subjects and SSwithin subjects researchers are interested in describing individual
as separate denominators, calculated percentages differences among participants but include only
are generally larger than when using SStotal as the extreme groups, the strength of association is
denominator in the ratio. Regardless of the likely to be positively biased. Conversely, using
approach used to calculate eta-squared, it is impor- a clinical research trial as an example, failure to
tant to clearly interpret the eta-squared statistics include an untreated control group in the design
for statistically significant between-subjects and might underestimate the eta-squared value. Finally,
within-subjects effects, respectively. attention to distinctions between random and fixed
effects, and the recognition of nested factors in
Strengths and Weaknesses multifactor ANOVA designs, is critical to the accu-
rate use, interpretation, and reporting of statistics
Descriptive Measure of Association that measure the strength of association between
Eta-squared is a descriptive measure of the independent and dependent variables.
strength of association between independent and
dependent variables in the sample. A benefit of the Reporting Effect Size
eta-squared statistic is that it permits researchers to
and Statistical Significance
descriptively understand how the variables in their
sample are behaving. Specifically, the eta-squared Social science research has been dominated by
statistic describes the amount of variation in the a reliance on significance testing, which is not
Ethics in the Research Process 425
particularly robust to small (N < 50Þ or large sam- Olejnik, S., & Algina, J. (2000). Measures of effect size
ple sizes ðN > 400Þ. More recently, some journals for comparative studies: Applications, interpretations,
publishers have adopted policies that require the and limitations. Contemporary Educational
reporting of effect sizes in addition to reporting Psychology, 25, 241286.
Olejnik, S., & Algina, J. (2003). Generalized eta and
statistical significance (p values). In 2001, the
omega squared statistics: Measures of effect size for
American Psychological Association strongly some common research designs. Psychological
encouraged researchers to include an index of Methods, 8(4), 434447.
effect size or strength of association between vari- Pierce, C. A., Block, R. A., & Aguinis, H. (2004).
ables when reporting study results. Social scientists Cautionary note on reporting eta-squared values from
who advocate for the reporting of effect sizes multifactor ANOVA designs. Educational and
argue that these statistics facilitate the evaluation Psychological Measurement, 64, 916924.
of how a study’s results fit into existing literature, Thompson, B. (2002). ‘‘Statistical,’’ ‘‘practical,’’ and
in terms of how similar or dissimilar results are ‘‘clinical’’: How many kinds of significance do
counselors need to consider? Journal of Counseling
across related studies and whether certain design
and Development, 80(1), 6471.
features or variables contribute to similarities or
differences in effects. Effect size comparisons using
eta-squared cannot be made across studies that dif-
fer in the populations they sampled (e.g., college
students vs. elderly individuals) or in terms of con-
ETHICS IN THE RESEARCH PROCESS
trolling relevant characteristics of the experimental
setting (e.g., time of day, temperature). Despite the In the human sciences, ethical concerns are felt at
encouragement to include strength of effects and the level of the practicing scientist and are the
significance testing, progress has been slow largely focus of scholarly attention in the field of research
because effect size computations, until recently, ethics. Most of the ethical issues have to do with
were not readily available in statistical software the scientist’s obligations and the limits on permis-
packages. sible scientific activity. Perspectives on these issues
are informed by ideas drawn from a variety of
intellectual traditions, including philosophical,
Kristen Fay and Michelle J. Boyd legal, and religious. Political views and cultural
See also Analysis of Variance (ANOVA); Effect Size,
values also influence the interpretation of
Measures of; Omega Squared; Partial Eta-
researcher conduct. Ethical questions about scien-
Squared; R2; Single-Subject Design; Within-Subjects
tific activity were once considered external to the
Design
research endeavor, but today, it is taken for
granted that researchers will reflect on the deci-
sions that they make when designing a study and
Further Readings the ethical ramifications that their work might
have. Scientists are also expected to engage in dia-
Cohen, J. (1973). Eta-squared and partial eta-squared in
logue on topics that range from the controversial,
fixed factor ANOVA designs. Educational and
Psychological Measurement, 33, 107112. such as the choice to study intelligence or conduct
Keppel, G. (1991). Design and analysis: A researcher’s HIV trials, to the procedural, such as whether
handbook (3rd ed.). Englewood Cliffs, NJ: Prentice research volunteers are entitled to payment for
Hall. their services.
Kirk, R. E. (1996). Practical significance: A concept
whose time has come. Educational and Psychological
Measurement, 56, 363368. Key Themes in Research Ethics
Maxwell, S. E., & Delaney, H. D. (2000). Designing
experiments and analyzing data. Mahwah, NJ:
Nearly any decision that scientists make can have
Lawrence Erlbaum. ethical implications, but the questions most often
Meyers, L. G., Gamst, G., & Guarino, A. J. (2006). addressed under the broad heading of Research
Applied multivariate research: Design and Ethics can be grouped as follows: (a) guidelines
interpretation. London: Sage. and oversight, (b) autonomy and informed
426 Ethics in the Research Process
consent, (c) standards and relativism, (d) conflicts principles in law and medicine. In particular, from
of interest, and (e) the art of ethical judgment. medical practice came injunctions like the doctor’s
There is no exhaustive list of ethical problems ‘‘Do no harm.’’ And from jurisprudence came the
because what constitutes an ethical problem for notion that nonconsensual touching can amount
researchers is determined by a number of factors, to assault, and that those so treated might have
including current fashions in research (not always valid claims for restitution.
in the human sciences) and such things as the pre- A common theme in these early codes was that
vailing political climate. Hence, a decision that research prerogatives must be secondary to the
a researcher makes might be regarded as contro- dignity and overall welfare of the humans under
versial for several reasons, including a general study. In the late 1970s, the Belmont Report
sense that the action is out of step with the greater expanded on this with its recommendation that
good. It is also common for decisions to be contro- a system of institutional review boards (IRBs)
versial because they are deemed to be contrary to should ensure compliance with standards. The
the values that a particular scientific association IRBs were also to be a liaison between researchers
promotes. and anyone recruited by them. In addition, Bel-
Legislation often piggybacks on such senti- mont popularized a framework that researchers
ments, with a close connection between views on and scholars could use when discussing ethical
what is ethical and what should (or should not) be issues. At first, the framework comprised moral
enforced by law. In many countries, governmental principles like beneficence, justice, nonmaleficence,
panels weigh in on issues in research ethics. There, and respect for autonomy. In recent years, com-
too, however, the categories of inquiry are fluid, mentators have supplemented it with ideas from
with the panelists drawing on social, economic, care ethics, casuistry, political theory, and other
and other considerations. The amorphous nature schools of thought.
of the ethical deliberation that researchers might Now commonplace in the lives of scientists,
be party to thus results from there being so few codes of ethics were once dismissed as mere
absolutes in science or ethics. Most of the ques- attempts to ‘‘legislate morality.’’ Skeptics also
tions that researchers confront about the design of warned that scholarly debate about ethics would
a study or proper conduct can readily be set have little to offer researchers. In retrospect, it is
against reasonable counter-questions. This does plain that any lines between law and morality
not rule out ethical distinctions, however. Just as were blurred long before there was serious interest
the evaluation of scientific findings requires a com- in codes of ethics. Not only that, rules and abstrac-
bination of interpretive finesse and seasoned reflec- tions pertaining to ethics have always had to meet
tion, moral judgment requires the ability to basic tests of relevance and practicality. In this pro-
critically evaluate supporting arguments. cess, scientists are not left out of the deliberations;
they play an active role in helping to scrutinize the
codes. Today, those codes are malleable artifacts,
Guidelines and Oversight
registers of current opinions about ethical values
Current codes of ethics have their origin in the in science. They are also widely available online,
aftermath of World War II, when interest in formal usually with accompanying discussions of related
guidelines and oversight bodies first arose. News ethical issues.
of the atrocities committed by Nazi researchers
highlighted the fact that, with no consistent stan-
Autonomy and Informed Consent
dards for scientific conduct, judgments about
methods or approach were left to each researcher’s In human research, one group, made up of scien-
discretion. The unscrupulous scientist was free to tists, if not entire disciplines, singles out another
conduct a study simply to see what might happen, group for study. Because this selection is almost
for example, or to conscript research ‘‘volunteers.’’ never random, those in the latter group will usu-
The authors of the Nuremberg Code and, later, the ally know much less about the research and why
Helsinki Declaration changed this by constructing they were selected. This can create a significant
a zone of protection out of several long-standing imbalance of power that can place the subjects
Ethics in the Research Process 427
(also called recruits, patients, participants, or infor- collaborators and less like ‘‘guinea pigs,’’ even
mants) in a subordinate position. The prospect without the ritualistic process of seeking consent.
that this will also leave the subjects vulnerable But commentators tend to agree that if consent
raises a number of ethical and legal issues to which rules are softened, there should be a presumption
researchers must respond when designing their of unusually important benefits and minimal risks.
studies. As that demand is hard to meet, the trend is
Informed consent rules are the most common toward involving only bona fide volunteers. This
response to this problem of vulnerability. The spe- preference for rather restrictive guidelines reflects
cific details vary, but these rules usually call on apprehension about the legal consequences of
researchers to provide a clear account of what involving people in research against their will, as
might be in store for anyone who might serve as well as a desire to avoid situations where a lack of
a subject. In most cases, the rules have a strong consent might be a prelude to serious harm.
legalistic strain, in that the subjects contract to
serve without pressure from the researchers, and
with the option of changing their minds later. Such
Standards and Relativism
stipulations have been central to ethics codes from
the outset, and in the review of research protocols. One of the oldest philosophical questions asks
Still, as informed consent rules are applied, their whether there is one moral truth or many truths.
ethical significance rests with how researchers This question is of particular concern for research-
choose to define this operative concept. There are ers when there is what seems to be a ‘‘gray area’’
questions about how much or how little someone in ethical guidelines, that is, when a code of ethics
is consenting to when agreeing to participate, for does not appear to offer clear, explicit recommen-
example. There are questions about whether con- dations. In those settings, researchers ordinarily
sent also gives the researcher license to disseminate must decide which course represents the best com-
study results in a particular manner. As might be promise between their objectives and the interests
expected, there are also differences of opinions on of their subjects. Some commentators see the prob-
what it means to ‘‘inform’’ the subjects, and lem of ethical relativism in dilemmas like these.
whether someone who is deliberately mis- or unin- And although not everyone accepts the label of
formed is thereby denied autonomy. What moti- ‘‘relativism,’’ with some preferring to speak of
vates these disagreements are the legitimate moral pluralism or even ‘‘flexibility’’ instead, most
concerns about the degree of protection that is agree that the underlying problem bears on deci-
needed in some research, and the practicality of sions about the design and conduct of research.
consent guidelines. Research has never been governed by one
Whereas some scholars would have researchers overarching ethical standard that is complete in
err on the side of caution with very strict consent itself. Ethical standards have usually accommo-
guidelines, others maintain that consent require- dated differences in methodological orientation,
ments can make certain types of research impossi- perceived levels of risk, and other criteria. Unfor-
ble or much less naturalistic. They argue that tunately, such accommodation has never been
conditions worthy of study can be fleeting or sensi- straightforward, and it is even less so now that
tive enough that researchers have little time to disciplinary boundaries are shifting and tradi-
negotiate the subjects’ participation. Observational tional methods are being applied in novel ways.
research into crowd behavior or the elite in gov- Psychologists now delve into what used to be
ernment would be examples of studies that might considered medical research, historians work
be compromised if researchers were required to alongside public health specialists, investigative
distribute informed consent paperwork first. In journalism bears similarities to undercover field-
clinical trials, scientists might wonder how they work, and some clinical studies rely on the anal-
should understand consent when their patients ysis of patient narratives. Because of this, to
may be unable to truly understand the research. recommend that anthropologists merely refer to
There is much interest in developing research the code of ethics for their discipline is to offer
designs that can help the subjects feel more like advice of very limited value.
428 Ethics in the Research Process
and others are treated on an ongoing basis as part It is also very common to criticize a clinical trial
of the research process itself. by alleging that the patients might not be fully
It is helpful to remember that in the days of the apprised of the risks. Researchers who would
‘‘Gentleman Scientist,’’ there was little need to design studies of genetic manipulation are asked to
worry about conflicts of interest. Research was consider the damage that might result. And a pro-
ethical if it conformed to a tacitly understood minent objection against including too many iden-
model of the humanistic intellectual. A similar tifying details in a published ethnography is that
model of virtue is still needed, yet today, research- the researcher will be unable to control the harm
ers face conflicts of interest not covered by any that this could cause. Where there are doubts
code, written or otherwise, and they engage in sci- about this language and the assessment that would
ence for any number of reasons. There are also declare one study ethical and another forbidden,
good reasons to think that conflicts of interest can there are questions about how well subjects are
serve as test cases for the values that researchers being protected or how research is being designed.
are expected to support. Under that interpretation, Most scholars would grant that researchers
an important benefit of the researcher’s having to must do more than offer a cost-benefit analysis for
grapple with conflicts of interest would be the con- their protocols. But this concession can still leave
tinual reexamination of such things as the role of researchers without a clear sense of what that
business or government in science, or the value of something else should be. For instance, textbooks
knowledge for its own sake, aside from its practi- on research design commonly recommend that
cal applications. researchers be honest and forthcoming with their
subjects, aside from how this might affect the
breakdown of anticipated risks. In practice, how-
The Art of Ethical Judgment
ever, researchers ordinarily do not deal with only
Often, the researcher’s most immediate ethical one or two moral principles. And even if this were
concern is the need to comply with institutional not so, it can be unreasonable to ask that research-
standards. This compliance is usually obtained by ers give an account of how the risks from a loss of
submitting a protocol in accordance with the vari- privacy in their subject population can be weighed
ous guidelines applicable to the type of research against the benefits that a marginalized population
involved. This process can lend an administrative, might gain from a particular study.
official stamp to ethical assessment, but it can also In short, what is needed is an ability to identify
obscure the wide range of values that are in play. variables that are either ignored or overempha-
In particular, critics charge that researchers too sized in current assessment strategies. It is natural
often feel pressured to look upon this assessment to turn to researchers to help refine that search.
as something that can be reduced to a ledger of This is not asking that researchers develop moral
anticipated risks and benefits. wisdom. Researchers are, rather, the most qualified
Critics also object that the review process rarely to devise improved methods of ethical assessment.
includes a provision for checking whether the risk- They are also positioned best to bring any current
benefit forecast proves too accurate. Even if deficiencies in those methods to the attention of
researchers did express interest in such verification, fellow scholars. Needless to say, researchers have
few mechanisms would enable it. The use of ani- very practical reasons to improve their ability to
mals in research, as in a toxicity study, is said to justify their work: Society is unlikely to stop ask-
illustrate some of these problems. Although risks ing for an accounting of benefits and risks. And
and benefits are clearly at issue, training programs the emphasis on outcomes, whether risks, benefits,
usually provide little advice on how researchers or some other set of parameters, is consistent with
are to compare the risks to the animals against the the priority usually given to empiricism in science.
benefits that patients are thought to gain from it. In other words, where there is even the possibility
Concerns like these are of first importance, as that researchers are unable to adequately gauge
scientists are taught early in their careers that pro- the effects of their work, there will be an impres-
tocols must be presented with assurances that sion that scientists are accepting significant short-
a study is safe and in the public’s interest. comings in the way that a study is deemed
430 Ethnography
a success or failure. That perception scientists can- Plomer, A. (2005). The law and ethics of medical
not afford, so it will not do to fall back on the research: International bioethics and human rights.
position that ambiguity about ethical values is sim- London: Routledge.
ply the price that researchers must pay. More sen- van den Hoonaard, W. C. (Ed.). (2002). Walking the
tightrope: Ethical issues for qualitative researchers.
sible is to enlist researchers in the search for ways
Toronto, Ontario, Canada: University of Toronto
to understand how the design of a study affects Press.
humans, animals, and the environment.
spectrum of contemporary theoretical frameworks only answers to the questions he or she brings into
affords a broad range of perspectives such as semi- the field, but also questions to explain what is
otics, poststructuralism, deconstructionist herme- being observed.
neutics, postmodern, and feminist. Likewise,
a range of genres of contemporary ethnographies
Participant Observation
has evolved in contrast to traditional ethnogra-
phies: autoethnography, critical ethnography, eth- Participant observation is the bedrock of doing
nodrama, ethnopoetics, and ethnofiction. Just as and writing ethnography. It exists on a continuum
ethnography has evolved, the question of whether from solely observing to fully participating. The
ethnography is doable across disciplines has ethnographer’s point of entry and how he or she
evolved into how ethnographic methods might moves along the continuum is determined by the
enhance understanding of the discipline-specific problem being explored, the situation, and his or
research problem being examined. her personality and research style. The ethnogra-
pher becomes ‘‘self-as-instrument,’’ being con-
scious of how his or her level of participant
Fieldwork
observation to collect varying levels of data affects
An expectation of ethnography is that the ethnog- objectivity. Data are collected in fieldnotes written
rapher goes into the field to collect his or her own in the moment as well as in reflection later.
data rather than rely on data collected by others. Therein, participant observation can be viewed as
To conduct ethnography is to do fieldwork. a lens, a way of seeing. Observations are made
Throughout the evolution of ethnography, field- and theoretically intellectualized at the limitation
work persists as the sine qua non. Fieldwork pro- of not making other observations or intellectualiz-
vides the ethnographer with a firsthand cultural/ ing observations with another theoretical perspec-
social experience that cannot be gained otherwise. tive. The strength of extensive participant
Cultural/social immersion is irreplaceable for pro- observation is that everyone, members and ethnog-
viding a way of seeing. In the repetitive act of rapher, likely assumes natural behaviors over pro-
immersing and removing oneself from a setting, longed fieldwork. Repeated observations with
the ethnographer can move between making up- varying levels of participation provide the means
close observations and then taking a distant con- for how ‘‘ethnography makes the exotic familiar
templative perspective in a deliberate effort to and the familiar exotic.’’
understand the culture or social setting intellectu-
ally. Fieldwork provides a mechanism for learning
Interviewing
the meanings that members are using to organize
their behavior and interpret their experience. A key assumption of traditional ethnographic
Across discipline approaches, three fieldwork research (i.e., anthropological, sociological) is that
methods define and distinguish ethnographic the cultural or social setting represents the sample.
research. Participant observation, representative The members of a particular setting are sampled
interviewing, and archival strategies, or what as part of the setting. Members’ narratives are
Harry Wolcott calls experiencing, enquiring, and informally and formally elicited in relation to par-
examining, are hallmark ethnographic methods. ticipant observation. Insightful fieldwork depends
Therein, ethnographic research is renowned for on both thoughtful, in-the-moment conversation
the triangulation of methods, for engaging multi- and structured interviewing with predetermined
ple ways of knowing. Ethnography is not a reduc- questions or probes, because interviews can flesh
tionist method, focused on reducing data into out socially acquired messages. Yet interviewing is
a few significant findings. Rather, ethnography contingent on the cultural and social ethos, so it is
employs multiple methods to flesh out the com- not a given of fieldwork. Hence, a critical skill the
plexities of a setting. The specific methods used ethnographer must learn is to discern when inter-
are determined by the research questions and the viewing adds depth and understanding to partici-
setting being explored. In this vein, Charles Frake pant observation and when interviewing interrupts
advanced that the ethnographer seeks to find not focused fieldwork.
Evidence-Based Decision Making 433
always work with a certain level of uncertainty. approach ensures that the methodology used, as
The need for making and providing proper inter- well as the logic of the researcher, to arrive at
pretation of the data findings requires that error- conclusions are sound.
prone humans acknowledge how a decision is This entry explores the history of the evidence-
made. based movement and the role of decision analysis
Much of the knowledge gained in what is in evidence-based decision making. In addition,
known as the evidence-based movement comes algorithms and decision trees, and their differ-
from those in the medical field. Statisticians are an ences, are examined.
integral part of this paradigm because they aid in
producing the evidence through various and
appropriate methodologies, but they have yet to History and Explanation of the
define and use terms encompassing the evidence-
Evidence-Based Movement
based mantra. Research methodology continues to
advance, and this advancement contributes to Evidence-based decision making stems from the
a wider base of information. Because of this, the evidence-based medicine movement that began in
need continues to develop better approaches to the Canada in the late 1980s. David Sackett defined
evaluation and utilization of research information. the paradigm of evidence-based medicine/practice
However, such advancements will continue to as the conscientious, explicit, and judicious use of
require that a decision be rendered. current best evidence about the care of individual
In the clinical sciences, evidence-based deci- patients. The evidence-based movement has grown
sion making is defined as a type of informal rapidly. In 1992, there was one publication on evi-
decision-making process that combines a clini- dence-based practices; by 1998, there were in
cian’s professional expertise coupled with the excess of 1,000. The evidence-based movement
patient’s concerns and evidence gathered from continues to enjoy rapid growth in all areas of
scientific literature to arrive at a diagnosis and health care and is seeing headways made in
treatment recommendation. Milos Jenicek fur- education.
ther clarified that evidence-based decision mak- From Sackett’s paradigm definition comes
ing is the systematic application of the best a model that encompasses three core pillars, all of
available evidence to the evaluation of options which are equally weighted. These three areas are
and to decision making in clinical, management, practitioner experience and expertise, evidence
and policy settings. from quantitative research, and individual
Because there is no mutually agreed-upon defi- (patient) preferences.
nition of evidence-based decision making among The first pillar in an evidence-based model is
statisticians, a novel definition is offered here. In the practitioner’s individual expertise in his or her
statistical research, evidence-based decision mak- respective field. To make such a model work, the
ing is defined as using the findings from the statisti- individual practitioner has to take into consider-
cal measures employed and correctly interpreting ation biases, past experience, and training. Typi-
the results, thereby making a rational conclusion. cally, the practitioner, be it a field practitioner or
The evidence that the researcher compiles is a doctoral-level statistician, has undergone some
viewed scientifically through the use of a defined form of mentoring in his or her graduate years.
methodology that values systematic as well as rep- Such a mentorship encourages what is known as
licable methods for production. the apprentice model, which, in and of itself, is
The value of an evidence-based decision- authoritarian and can be argued to be completely
making process provides a more rational, credi- subjective. The evidence-based decision-making
ble basis for the decisions a researcher and/or cli- model attempts to move the researcher to use the
nician makes. In the clinical sciences, the value latest research findings on statistical methodology
of an evidence-based decision makes patient care instead of relying on the more subjective authori-
more efficient by valuing the role the patient tarian model (mentorstudent).
plays in the decision-making process. In statisti- The second pillar in an evidence-based model
cal research, the value of an evidence-based for the researcher is the use of the latest research
Evidence-Based Decision Making 435
findings that are applicable for the person’s field of a method that allows one to make such determina-
study. It relies mainly on systematic reviews and tions. Jenicek suggested that decision analysis is
meta-analysis studies followed by randomized con- not a direction-giving method but rather a direc-
trolled trials. These types of studies have the high- tion-finding method. Direction-giving methods will
est value in the hierarchy of evidence. Prior to the be described later on in the form of the decision
evidence-based movement in the field of medicine, tree and the algorithm.
expert opinion and case studies, coupled with Jenicek suggested that decision analysis has
practitioner experience and inspired by their field- seven distinct stages. The first stage of decision
based mentor, formed much of the practice of clin- analysis requires one to adequately define the
ical medicine. problem. The second stage in the decision analysis
The third pillar of an evidence-based model is process is to provide an answer to the question,
patient preferences. In the clinical sciences, the ‘‘What is the question to be answered by decision
need for an active patient as opposed to a passive analysis?’’ In this stage, true positive, true negative,
patient has become paramount. Including the false negative, and false positive results, as well as
patient in the decision making about his or her other things, need to be taken into consideration.
own care instills in the patient a more active role. The third stage in the process is the structuring of
Such an active role by the patient is seen to the problem over time and space. This stage
strengthen the doctorpatient encounter. For the encompasses several key aspects. The researcher
statistician and researcher, it would appear that must recognize the starting decision point, make
such a model would not affect their efforts. How- an overview of possible decision options and their
ever, appreciation of this pillar in the scientific outcomes, and establish temporo-spatial sequence.
research enterprise can be seen in human subject Deletion of unrealistic and/or impossible or irrele-
protection. vant options is also performed at this stage. The
From this base paradigm arose the term evi- fourth stage in the process involves giving dimen-
dence-based decision making. Previously, research- sion to all the relevant components of the problem.
ers made decisions based on personal observation, This is accomplished by obtaining available data
intuition, and authority, as well as belief and tradi- to figure out probabilities. Obtaining the best and
tion. Although the researcher examined the evi- most objective data for each relevant outcome is
dence that was produced from the statistical also performed here. The fifth stage is the analysis
formulas used, he or she still relied on personal of the problem. The researcher will need to choose
observation, intuition, authority, belief, and tradi- the best way through the available decision paths.
tion. Interpretation of statistical methods is only as As well, the researcher will evaluate the sensitivity
good as the person making the interpretation of of the preferred decision. This stage is marked by
the findings. the all-important question, ‘‘What would happen
if conditions of the decision were to change?’’ The
final two stages are solve the problem and act
Decision Analysis
according to the result of the analysis. In evidence-
Decision analysis is the discipline for addressing based decision making, the stages that involve the
important decisions in a formal manner. It is com- use of the evidence is what highlights this entire
posed of the philosophy, theory, methodology, and process. The decision in evidence-based decision
professional practice to meet this end. John Last making is hampered if the statistical data are
suggested that decision analysis is derived from flawed. For the statistician, research methodology
game theory, which tends to identify all available using this approach will need to take into consider-
choices and the potential outcomes of each. ation efficacy (can it work?), effectiveness (does it
The novice researcher and/or statistician may work?), and efficiency (what does it cost in terms
not always know how to interpret results of a sta- of time and/or money for what it gives?).
tistical test. Moreover, statistical analysis can In evidence-based decision-making practice,
become more complicated because the inexperi- much criticism has been leveled at what may
enced researcher does not know which test is more appear to some as a reliance on statistical mea-
suitable for a given situation. Decision analysis is sures. It should be duly noted that those who
436 Evidence-Based Decision Making
follow the evidence-based paradigm realize that health, disease evolution, and policy manage-
evidence does not make the decision. However, ment. The analysis, which involves a decision to
those in the evidence-based movement acknowl- be made, leads to the best option. Choices and/
edge that valid and reliable evidence is needed to or options that are available at each stage in the
make a good decision. thinking process have been likened to branches
Jenicek has determined that decision analysis on a tree—a decision tree. The best option could
has its own inherent advantages and disadvan- be the most beneficial, most efficacious, and/or
tages. Advantages to decision analysis are that it is most cost-effective choice among the multiple
much less costly than the search for the best deci- choices to be made. The graphical representation
sion through experimental research. Such experi- gives the person who will make the decision
mental research is often sophisticated in design a method by which to find the best solution
and complex in execution and analysis. Another among multiple options. Such multiple options
advantage is that decision analysis can be easily can include choices, actions, and possible out-
translated into clinical decisions and public health comes, and their corresponding values.
policies. There is also an advantage in the educa- A decision tree is a classifier in the form of a tree
tional realm. Decision analysis is an important tool structure where a node is encountered. A decision
that allows students to better structure their think- tree can have either a leaf node or a decision node.
ing and to navigate the maze of the decision-mak- A leaf node is a point that indicates the value of
ing process. A disadvantage to decision analysis is a target attribute or class of examples. A decision
that it can be less valuable if the data and informa- node is a point that specifies some test to be car-
tion are of poor quality. ried out on a single attribute or value. From this,
a branch of the tree and/or subtree can represent
a possible outcome of a test or scenario. For exam-
Algorithm
ple, a decision tree can be used to classify a sce-
John Last defined an algorithm as any systematic nario by starting at the root of the tree and
process that consists of an ordered sequence of moving through it. The movement is temporarily
steps with each step depending on the outcome of halted when a leaf node, which provides a possible
the previous one. It is a term that is commonly outcome or classification of the instance, is
used to describe a structured process. It is a graphi- encountered.
cal representation commonly seen as a flow chart. Decision trees have several advantages. First
An algorithm can be described as a specific set of of all, decision trees are simple to understand
instructions for carrying out a procedure or solving and interpret. After a brief explanation, most
a problem. It usually requires that a particular pro- people are able to understand the model. Second,
cedure terminate at some point when questions are decision trees have a value attached to them even
readily answered in the affirmative or negative. An if very little hard data support them. Jenicek sug-
algorithm is, by its nature, a set of rules for solving gested that important insights can be generated
a problem in a finite number of steps. Other names based on experts describing a situation along
used to describe an algorithm have been method, with its alternative, probabilities, and cost, as
procedure, and/or technique. In decision analysis, well as the experts’ preference for a suitable out-
algorithms have also been defined as decision anal- come. Third, a decision tree can easily replicate
ysis algorithms. They have been argued to be best a result with simple math. The final advantage to
suited for clinical practice guidelines and for teach- a decision tree is that it can easily incorporate
ing. One of the criticisms of an algorithm is that it with other decision techniques. Overall, decision
can restrict critical thought. trees represent rules and provide a classification
as well as prediction. More important, the deci-
sion tree as a decision-making entity allows the
Decision Tree
researcher the ability to explain and argue why
A decision tree is a type of decision analysis. the reason for a decision is crucial. It should be
Jenicek defined a decision tree as a graphical rep- noted that not everything that has branches can
resentation of various options in such things as be considered a decision tree.
Exclusion Criteria 437
hand, cotinine in blood may measure exposure to exclusion criteria after IRB approval require new
secondhand smoking, thus excluding subjects who approval of any amendments.
should not be excluded; therefore, a combination In epidemiologic and clinical research, asses-
of self-reported smoking and cotinine in blood sing an exposure or intervention under strict
may increase the sensitivity, specificity, validity, study conditions is called efficacy, whereas doing
and reliability of such measurement, but it will be so in real-world settings is called effectiveness.
more costly and time consuming. Concerns have been raised about the ability to
A definition of exclusion criteria that requires generalize the results from randomized clinical
several measurements may be just as good as one trials to a broader population, because partici-
using fewer measurements. Good validity and reli- pants are often not representative of those seen
ability of exclusion criteria will help minimize ran- in clinical practice. Each additional exclusion
dom error, selection bias, and confounding, thus criterion implies a different sample population
improving the likelihood of finding an association, and approaches the assessment of efficacy, rather
if there is one, between the exposures or interven- than effectiveness, of the exposure or interven-
tions and the outcomes; it will also decrease the tion under study, thus influencing the utility and
required sample size and allow representativeness applicability of study findings. For example,
of the sample population. Using standardized studies of treatment for alcohol abuse have
exclusion criteria is necessary to accomplish con- shown that applying stringent exclusion criteria
sistency, replicability, and comparability of find- used in research settings to the population at
ings across similar studies on a research topic. large results in a disproportionate exclusion of
Standardized disease-scoring definitions are avail- African Americans, subjects with low socioeco-
able for mental and general diseases (Diagnostic nomic status, and subjects with multiple sub-
and Statistical Manual of Mental Disorders and stance abuse and psychiatric problems.
International Classification of Diseases, respec- Therefore, the use of more permissive exclusion
tively). Study results on a given research topic criteria has been recommended for research stud-
should carefully compare the exclusion criteria to ies on this topic so that results are applicable to
analyze consistency of findings and applicability to broader, real-life populations. The selection and
sample and target populations. Exclusion criteria application of these exclusion criteria will also
must be as parsimonious in number as possible; have important consequences on the assurance
each additional exclusion criterion may decrease of ethical principles, because excluding subjects
sample size and result in selection bias, thus affect- based on race, gender, socioeconomic status,
ing the internal validity of a study and the external age, or clinical characteristics may imply an
validity (generalizability) of results, in addition to uneven distribution of benefits and harms, disre-
increasing the cost, time, and complexity of gard for the autonomy of subjects, and lack of
recruiting study participants. Exclusion criteria respect. Researchers must strike a balance
must be selected carefully based upon a review of between stringent and more permissive exclusion
the literature on the research topic, in-depth criteria. On one hand, stringent exclusion crite-
knowledge of the theoretical framework, and their ria may reduce the generalizability of sample
feasibility and logistic applicability. study findings to the target population, as well
Research proposals submitted for institutional as hinder recruitment and sampling of study sub-
review board (IRB) approval should clearly jects. On the other hand, they will allow rigor-
describe exclusion criteria to potential study parti- ous study conditions that will increase the
cipants, as well as consequences, at the time of homogeneity of the sample population, thus
obtaining informed consent. Often, research proto- minimizing confounding and increasing the like-
col amendments that change the exclusion criteria lihood of finding a true association between
will result in two different sample populations that exposure/intervention and outcomes. Confound-
may require separate data analyses with a justifica- ing may result from the effects of concomitant
tion for drawing composite inferences. Exceptions medical conditions, use of medications other
to exclusion criteria need to be approved by the than the one under study, surgical or rehabilita-
IRB of the research institution; changes to tion interventions, or changes in the severity of
Exogenous Variables 439
disease in the intervention group or occurrence Szklo, M., & Nieto, F. J. (2007). Epidemiology: Beyond
of disease in the nonintervention group. the basics. Boston: Jones & Bartlett.
Changes in the baseline characteristics of study
subjects that will likely affect the outcomes of the
study may also be stated as exclusion criteria. For
example, women who need to undergo a specimen EXOGENOUS VARIABLES
collection procedure involving repeated vaginal
exams may be excluded if they get pregnant during Exogenous originated from the Greek words exo
the course of the study. In clinical trials, exclusion (meaning ‘‘outside’’) and gen (meaning ‘‘born’’),
criteria identify subjects with an unacceptable risk and describes something generated from outside
of taking a given therapy or even a placebo (for a system. It is the opposite of endogenous, which
example, subjects allergic to the placebo sub- describes something generated from within the sys-
stance). Also, exclusion criteria will serve as the tem. Exogenous variables, therefore, are variables
basis for contraindications to receive treatment that are not caused by any other variables in
(subjects with comorbidity or allergic reactions, a model of interest; in other words, their value is
pregnant women, children, etc.). Unnecessary not determined in the system being studied.
exclusion criteria will result in withholding treat- The concept of exogeneity is used in many
ment from patients who may likely benefit from fields, such as biology (an exogenous factor is a fac-
a given therapy and preclude the translation of tor derived or developed from outside the body);
research results into practice. Unexpected reasons geography (an exogenous process takes place out-
for subjects’ withdrawal or attrition after inception side the surface of the earth, such as weathering,
of the study are not exclusion criteria. An addi- erosion, and sedimentation); and economics (exog-
tional objective of exclusion criteria in clinical enous change is a change coming from outside the
trials is enhancing the differences in effect between economics model, such as changes in customers’
a drug and a placebo; to this end, subjects with tastes or income for a supply-and-demand model).
short duration of the disease episode, those with Exogeneity has both statistical and causal interpre-
mild severity of illness, and those who have a posi- tations in social sciences. The following discussion
tive response to a placebo may be excluded from focuses on the causal interpretation of exogeneity.
the study.
Eduardo Velasco Exogenous Variables in a System
Although exogenous variables are not caused by
See also Bias; Confounding; Inclusion Criteria;
any other variables in a model of interest, they
Reliability; Sampling; Selection; Validity of
may cause the change of other variables in the
Measurement; Validity of Research Conclusions
model. In the specification of a model, exogenous
variables are usually labeled with Xs and endoge-
nous variables are usually labeled with Ys. Exoge-
Further Readings nous variables are the ‘‘input’’ of the model,
Gordis, L. (2008). Epidemiology (4th ed.). Philadelphia: predetermined or ‘‘given’’ to the model. They are
W. B. Saunders. also called predictors or independent variables.
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, The following is an example from educational
D., & Newman, T. B. (Eds.). (2007). Designing research. Family income is an exogenous variable
clinical research: An epidemiologic approach (3rd ed.). to the causal system consisting of preschool atten-
Philadelphia: Lippincott Williams & Wilkins. dance and student performance in elementary
LoBiondo-Wood, G., & Haber, J. (2006). Nursing
school. Because family income is determined by
research: Methods and critical appraisal for evidence-
based practice (6th ed.). St. Louis, MO: Mosby.
neither a student’s preschool attendance nor ele-
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). mentary school performance, family income is an
Experimental and quasi-experimental designs for exogenous variable to the system being studied.
generalized causal inference. Boston: Houghton On the other hand, students’ family income may
Mifflin. determine both preschool attendance and
440 Exogenous Variables
X
See also Cause and Effect; Endogenous Variables; Path E½x ¼ x · pðxÞ‚
Analysis x∈
Further Readings
and the above expected value exists if the above
sum of absolute
P value of X is absolutely conver-
Kline, R. B. (2005). Principles and practice of structural gent, that is, x ∈ jxj is finite.
equation modeling (2nd ed.). New York: Guilford. As a simple example, suppose x can take on
Pearl, J. (2000). Causality: Models, reasoning, and two values, 0 and 1, which occur with probabili-
inference. Cambridge, UK: Cambridge University
ties .4 and .6. Then
Press.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
E½X ¼ ðx ¼ 0Þ × pðx ¼ 0Þ þ ðx ¼ 1Þ × pðx ¼ 1Þ
Experimental and quasi-experimental designs for :
generalized causal inference. New York: Houghton ¼ 0 × 0:4 þ 1 × 0:6 ¼ 0:6
Mifflin.
If G(X) is a function of RV X, its expected
value, E[G(X)], is a weighted average of the possi-
ble values of G(X) and is defined as
EXPECTED VALUE 8R
< x∈ GðxÞ·f ðxÞdx; for continuous case
The expected value is the mean of all values of E½GðXÞ ¼ P :
: GðxÞ·pðxÞ; for discrete case
a random variable weighted by the probability of x∈
the occurrence of the values. The expected value
(or expectation, or mean) of random variable (RV) The above expected value exists if the above sum
X is denoted as E[X] (or sometimes μ). of absolute value of G(X) is absolutely convergent.
If X also has a probability density function E ci Xi þc0 ¼ ci E½Xi þc0 ¼ ci μi þc0 ð3Þ
i¼1 i¼1 i¼1
(pdf) f(x) of certain probability distribution, the
above expected value of X can be formulated as
" #
Z X
n X
n
Z Z
Interpretation
E G Xi ;Xj ¼ G xi ;xj f xi ;xj dxi dxj ‚
xi ∈xi xj ∈xj
From a statistical point of view, the following
terms are important: arithmetic mean (or simply
8i‚j∈N‚ and i6¼ j:
mean), central tendency, and location statistic. The
arithmetic mean of X is the summation of the set
of observations (sample) of X ¼ {x1, x2; . . . ; xN} If Xi,Xj are discrete random variables with the
divided by the sample size N: joint pmf p(xi,xj), the expected value of G(Xi,Xj) is
X X
1X N E G Xi ;Xj ¼ G xi ;xj p xi ;xj 8i‚j∈N‚
X ¼ xi : xi ∈xi xj ∈xj
N i¼1
and i6¼ j:
X is called arithmetic mean/sample mean/
average when used to estimate the location of
a sample. When it is used to estimate the location Conditional Expected Value
of an underlying distribution, X is called popula-
tion mean/average, or expectation/expected value, Given that Xi,Xj are continuous random variables
which can be denoted as E[X] or μ: This is consis- with the joint pdf f(xi,xj) and f(xj) > 0, then the
tent with the original definition because the proba- conditional pdf of Xi given Xj ¼ xj is
bility of each value’s occurrence is equal to 1/N.
One could construct a different estimate of the f xi ; xj
fXi jXj xi jxj ¼ ‚ 8 all xi ‚
mean, for example, if some values were expected fXj xj
to occur more frequently than others. Because
a researcher rarely has such information, the and the corresponding conditional expected value
simple mean is most commonly used. is
Z
Moment
E Xi jXj ¼ xj ¼ xi · fXi jXj xi jxj dxi :
xi ∈ xi
The moment (a characteristic of a distribution) of
X about the real number c is defined as
Given that Xi,Xj are discrete random variables
n
E½ðx cÞ 8c ∈ R; and integer n ≥ 1: with the joint pmf p(xi,xj) and p(xj) > 0, then the
conditional pdf of Xi given Xj ¼ xj is
Hence, E[Xn] are also called central moments.
p xi ; xj
E[X] is called the first moment (* n ¼ 1) of X pXi jXj xi jxj ¼ ‚ 8 all xi ‚
about c ¼ 0, which is commonly called the mean pXj xj
of X. The second moment about the mean of X is
called the Variance of X. Theoretically, the entire and the corresponding conditional expected value
distribution of X can be described if all moments is
of X are known by using the moment-generating
functions, although only the first five moments are X
E Xi jXj ¼ xj ¼ xi · pXi jXj xi jxj :
generally necessary to specify a distribution com- xi ∈ xi
pletely. The third moment is termed skewness, and
the fourth is called kurtosis.
Furthermore, the expectation of the conditional
expectation of Xi given Xj ¼ xj is simply the
Joint Expected Value expectation of Xi:
If Xi,Xj are continuous random variables with the
joint pdf f ðxi ; xj ), the expected value of G(Xi,Xj) is E E Xi jXj ¼ E½Xi :
Expected Value 443
expected value of Xi is less than or equal to that Expectations and Variances for
of Xj: Well-Known Distributions
* Xi ≤ Xj ‚ ) E½Xj ≤ E½Yi : ð15Þ The following table lists distribution characteris-
tics, including the expected value of X and the
The expected value of the absolute value of expected variance of both discrete and continuous
a random variable X is less than or equal to the variables.
absolute value of its expectation:
Glass, G. (1964). [Review of the book Expected values of
1
P
N
1
P
N
1
P
N
N
yi N
yi xi N
xi discrete random variables and elementary statistics by
^1 ¼
b i¼1 i¼1 i¼1
A. L. Edwards]. Educational and Psychological
2
1
P
N P
N Measurement, 24, 969971.
N
xi N1 xi
i¼1 i¼1
and
EXPERIENCE SAMPLING METHOD
XN XN
^0 ¼ 1
b ^1 · 1
yi b xi :
N i¼1 N i¼1 The experience sampling method (ESM) is a strat-
egy for gathering information from individuals
If the random process is stationary (i.e., con- about their experience of daily life as it occurs.
stant mean and time shift-invariant covariance) The method can be used to gather both qualitative
P
N and quantitative data, with questions for partici-
and ergodic (i.e., time average N1 xi converges pants that are tailored to the purpose of the
i¼1 research. It is a phenomenological approach,
in mean-square sense to the ensemble average meaning that the individual’s own thoughts, per-
E[X]), the LSE estimator will be asymptotically ceptions of events, and allocation of attention are
equal to the MMSE estimator. the primary objects of study. In the prototypical
MMSE estimation is not the only possibility for application, participants in an ESM study are
the expected value. Even in simple data sets, other asked to carry with them for 1 week a signaling
statistics, such as the median or mode, can be used device such as an alarm wristwatch or palmtop
to estimate the expected value. They have proper- computer and a recording device such as a booklet
ties, however, that make them suboptimal and inef- of questionnaires. Participants are then signaled
ficient under many situations. Other estimation randomly 5 to 10 times daily, and at each signal,
approaches include the mean of a Bayesian poste- they complete a questionnaire. Items elicit infor-
rior distribution, the maximum likelihood estima- mation regarding the participants’ location at the
tor, or the biased estimator in ridge regression. In moment of the signal, as well as their activities,
modern statistics, these alternative estimation thoughts, social context, mood, cognitive effi-
methods are being increasingly used. For a normally ciency, and motivation. Researchers have used
distributed random variable, all will typically yield ESM to study the effects of television viewing on
the same value except for computational variation mood and motivation, the dynamics of family rela-
in computer routines, usually very small in magni- tions, the development of adolescents, the experi-
tude. For non-normal distributions, various other ence of engaged enjoyment (or flow), and many
considerations come into play, such as the type of mental and physical health issues. Other terms for
distribution encountered, amount of data available, ESM include time sampling, ambulatory assess-
and existence of the various moments. ment, and ecological momentary assessment; these
Victor L. Willson and Jiun-Yu Wu terms may or may not signify the addition of other
types of measures, such as physiological markers,
See also Chi-Square Test; Sampling Distributions; to the protocol.
Sampling Error
participant is signaled at a random moment during nonindependence in the data, person-level vari-
each segment. Other possibilities are to signal the ables are preferred when using inferential statisti-
participant at the same times every day (interval- cal techniques such as analysis of variance or
contingent sampling) or to ask the participant to multiple regression. More complex procedures,
respond after every occurrence of a particular such as hierarchical linear modeling, multilevel
event of interest (event-contingent sampling). The modeling, or mixed-effects random regression
number of times per day and the number of days analysis, allow the researcher to consider the
that participants are signaled are parameters that response-level and person-level effects
can be tailored based on the research purpose and simultaneously.
practical matters.
Increasingly, researchers are using palmtop
computers as both the signaling device and the Studies Involving ESM
recording device. The advantages here are the Mihaly Csikszentmihalyi was a pioneer of the
direct electronic entry of the data, the ability to method when he used pagers in the 1970s to study
time-stamp each response, and the ease of pro- a state of optimal experience he called flow. Csiks-
gramming a signaling schedule. Disadvantages zentmihalyi and his students found that when peo-
include the difficulty in obtaining open-ended ple experienced a high level of both challenges and
responses and the high cost of the devices. When skills simultaneously, they also frequently had high
a wristwatch or pager is used as the signaling levels of enjoyment, concentration, engagement,
device and a pen with a booklet of blank question- and intrinsic motivation. To study adolescents’
naires serves as the recording device, participants family relationships, Reed Larson and Maryse
can be asked open-ended questions such as ‘‘What Richards signaled adolescents and their parents
are you doing?’’ rather than be forced to choose simultaneously. The title of their book, Divergent
among a list of activity categories. This method is Realities, telegraphs one of their primary conclu-
less costly, but does require more coding and data sions. Several researchers have used ESM to study
entry labor. Technology appears to be advancing patients with mental illness, with many finding
to the point where an inexpensive electronic device that symptoms worsened when people were alone
will emerge that will allow the entry of open- with nothing to do. Two paradoxes exposed by
ended responses with ease, perhaps like text-mes- ESM research are that people tend to retrospec-
saging on a mobile phone. tively view their work as more negative and TV-
watching as more positive experiences than what
Analysis of ESM Data they actually report when signaled in the moment
while doing these activities.
Data resulting from an ESM study are complex,
including many repeated responses to each ques- Joel M. Hektner
tion. Responses from single items are also often
combined to form multi-item scales to measure See also Ecological Validity; Hierarchical Linear
constructs such as mood or intrinsic motivation. Modeling; Levels of Measurement; Multilevel
Descriptive information, such as means and fre- Modeling; Standardized Score; z Score
quencies, can be computed at the response level,
meaning that each response is treated as one case Further Readings
in the data. However, it is also useful to aggregate
the data by computing means within each person Bolger, N., Davis, A., & Rafaeli, E. (2003). Diary
and percentages of responses falling in categories methods: Capturing life as it is lived. Annual Review
of Psychology, 54, 579616.
of interest (e.g., when with friends). Often, z-
Hektner, J. M., Schmidt, J. A., & Csikszentmihalyi, M.
scored variables standardized to each person’s (2007). Experience sampling method: Measuring the
own mean and standard deviation are computed quality of everyday life. Thousand Oaks, CA: Sage.
to get a sense of how individuals’ experiences in Reis, H. T., & Gable, S. L. (2000). Event-sampling and
one context differ from their average levels of other methods for studying everyday experience. In
experiential quality. To avoid the problem of H. T. Reis & C. M. Judd (Eds.), Handbook of
Experimental Design 447
research methods in social and personality psychology age excludes itself from being an explanation of
(pp. 190222). New York: Cambridge the data.
University Press. There are numerous extraneous variables, any
Shiffman, S. (2000). Real-time self-report of momentary one of which may potentially be an explanation of
states in the natural environment: Computerized
the data. Ambiguity of this sort is minimized with
ecological momentary assessment. In A. A. Stone, J. S.
Turkkan, C. A. Bachrach, J. B. Jobe, H. S. Kurtzman,
appropriate control procedures, an example of
& V. S. Cain (Eds.), The science of self-report: which is random assignment of subjects to the two
Implications for research and practice (pp. 276293). conditions. The assumption is that, in the long
Mahwah, NJ: Lawrence Erlbaum. run, effects of unsuspected confounding variables
Walls, T. A., & Schafer, J. L. (Eds.). (2006). Models for may be balanced between the two conditions.
intensive longitudinal data. New York: Oxford
University Press.
Genres of Experimental Designs
for Data Analysis Purposes
one may ascertain trends in the data when a factor stand for the respective number of levels. For
has three or more levels (see Row b). Specifically, example, the name of a three-factor design is m by
a minimum of three levels is required for ascertain- n by p; the first independent variable has m levels,
ing a linear trend, and a minimum of four levels the second has n levels, and the third has p levels
for a quadratic trend. (see Row d of Table 2).
The lone statistical question of a one-factor,
two-level design (see Row a of Table 2) is asked
Two-Factor Designs
separately for Factors A and B in the case of the
Suppose that Factors A (e.g., room color) and B two-factor design (see [a] and [b] in Row c of
(e.g., room size) are used together in an experi- Table 2). Either of them is a main effect (see [a]
ment. Factor A has m levels; its two levels are a1 and [b] in Row c) so as to distinguish it from a sim-
and a2 when m ¼ 2. If Factor B has n levels (and ple effect (see Row c). This distinction may be
if n ¼ 2), the two levels of B are b1 and b2. The illustrated with Table 3.
experiment has a factorial design when every level
Main Effect
of A is combined with every level of B to define
a test condition or treatment combination. The Assume an equal number of subjects in all treat-
size of the factorial design is m by n; it has m-by-n ment combinations. The means of a1 and a2 are
treatment combinations. This notation may be 4.5 and 2.5, respectively (see the ‘‘Mean of ai’’ col-
generalized to reflect factorial design of any size. umn in either panel of Table 3). The main effect of
Specifically, the number of integers in the name A is 2 (i.e., 4.5 2.5). In the same vein, the means
of the design indicates the number of independent of b1 and b2 are 4 and 3, respectively (see the
variables, whereas the identities of the integers ‘‘Mean of bj’’ row in either panel of Table 3). The
Experimental Design 449
main effect of B is 1. That is, the two levels of B d2 ¼ Simple effect of A at b2 is (ab12 ab22) ¼
(or A) are averaged when the main effect of A (or (4 2) ¼ 2;
B) is being considered.
d3 ¼ Simple effect of B at a1 is (ab12 ab11) ¼
(4 5) ¼ 1;
Simple Effect
d4 ¼ Simple effect of B at a2 is (ab22 ab21) ¼
Given that there are two levels of A (or B), (2 3) ¼ 1.
it is possible to ask whether or not the two levels
of B (or A) differ at either level of A (or B).
Hence, there are the entries, d3 and d4, in the AB Interaction
‘‘Simple effect of B at ai’’ column, and the In view of the fact that there are two simple
entries, d1 and d2, ‘‘Simple effect of A at bj’’ row effects of A (or B), it is important to know
in either panel of Table 3. Those entries are the whether or not they differ. Consequently, the
four simple effects of the 2-by-2 factorial experi- effects noted above give rise to the following
ment. They may be summarized as follows: questions:
[Q1] (DofD)12: Is d1 d2 ¼ 0?
d1 ¼ Simple effect of A at b1 is (ab11 ab21) ¼
(5 3) ¼ 2; [Q2] (DofD)34: Is d3 d4 ¼ 0?
450 Experimental Design
Given that d1 d2 ¼ 0, one is informed that them are assigned randomly to each of the six
the effect of Variable A is independent of that of treatment combinations of a 2-by-3 factorial
Variable B. By the same token, that d3 d4 ¼ experiment. It is called the completely randomized
0 means that the effect of Variable B is indepen- design, but more commonly known as an unre-
dent of that of Variable A. That is to say, when lated sample (or an independent sample) design
the answers to both [Q1] and [Q2] are ‘‘Yes,’’ when there are only two levels to a lone indepen-
the joint effects of Variables A and B on the dent variable.
dependent variable are the sum of the individual
effects of Variables A and B. Variables A and B
Repeated Measures Design
are said to be additive in such an event.
Panel (b) of Table 3 illustrates a different sce- All subjects are tested in all treatment combina-
nario. The answers to both [Q1] and [Q2] are tions in a repeated measures design. It is known by
‘‘No.’’ It informs one that the effects of Variable the more familiar name related samples or depen-
A (or B) on the dependent variable differ at dif- dent samples design when there are only two levels
ferent levels of Variable B (or A). In short, it is to a lone independent variable. The related sam-
learned from a ‘‘No’’ answer to either [Q1] or ples case may be used to illustrate one complica-
[Q2] (or both) that the joint effects of Variables tion, namely, the potential artifact of the order of
A and B on the dependent variables are nonaddi- testing effect.
tive in the sense that their joint effects are not Suppose that all subjects are tested at Level I (or
the simple sum of the two separate effects. Vari- II) before being tested at Level II (or I). Whatever
ables A and B are said to interact (or there is the outcome might be, it is not clear whether the
a two-way AB interaction) in such an event. result is due to an inherent difference between
Levels I and II or to the proactive effects of the level
Multifactor Designs used first on the performance at the subsequent
level of the independent variable. For this reason,
What has been said about two-factor designs a procedure is used to balance the order of testing.
also applies to designs with three or more indepen- Specifically, subjects are randomly assigned to
dent variables (i.e., multifactor designs). For exam- two subgroups. Group 1 is tested with one order
ple, in the case of a three-factor design, it is (e.g., Level I before Level II), whereas Group 2 is
possible to ask questions about three main effects tested with the other order (Level II before Level I).
(A, B, and C); three 2-way interaction effects (AB, The more sophisticated Latin square arrangement is
AC, and BC interactions); a set of simple effects used to balance the order of test when there are
(e.g., the effect of Variable C at different treatment three or more levels to the independent variable.
combinations of AB, etc.); and a three-way inter-
action (viz., ABC interaction).
Randomized Block Design
Table 4. It is the underlying inductive rule when may do whatever is required of them. This demand
a qualitative independent variable is used (e.g., characteristics artifact creates credibility issues in the
room color). research data. The subject effect artifact questions
In short, an experimental design is a stipulation the generalizability of research data. This issue arises
of the formal arrangement of the independent, because participants in the majority of psychological
control, and independent variables, as well as the research are volunteering tertiary-level students who
control procedure, of an experiment. Underlying may differ from the population at large.
every experimental design is an inductive rule that As an individual, a researcher has profound
reduces ambiguity by rendering it possible to effects on the data. Any personal characteristics of
exclude alternative interpretations of the result. the researcher may affect research participants
Each control variable or control procedure (e.g., ethnicity, appearance, demeanor). Having
excludes one alternative explanation of the data. vested interests in certain outcomes, researchers
approach their work from particular theoretical
Siu L. Chow perspectives. These biases determine in some subtle
and insidious ways how researchers might behave
See also Replication; Research Hypothesis; Rosenthal
in the course of conducting research. This is the
Effect
experimenter expectancy effect artifact.
Further Readings At the same time, the demand characteristics
artifact predisposes research participants to pick up
Boring, E. G. (1954). The nature and history of
experimental control. American Journal of
cues about the researcher’s expectations. Being
Psychology, 67, 573589. obligingly ingratiatory, research participants ‘‘coop-
Chow, S. L. (1992). Research methods in psychology: A erate’’ with the researcher to obtain the desired
primer. Calgary, Alberta, Canada: Detselig. results. The experimenter expectancy effect artifact
Mill, J. S. (1973). A system of logic: Ratiocinative and detracts research conclusions from their objectivity.
inductive. Toronto, Ontario, Canada: University of
Toronto Press.
SPOPE Revisited—SPONE
Limits of Goodwill
EXPERIMENTER EXPECTANCY Although research participants bear goodwill
EFFECT toward researchers, they may not (and often can-
not) fake responses to please the researcher as
The experimenter’s expectancy effect is an impor- implied in the SPOPE thesis.
tant component of the social psychology of the psy- To begin with, research participants might give
chological experiment (SPOPE), whose thesis is that untruthful responses only when illegitimate fea-
conducting or participating in research is a social tures in the research procedure render it necessary
activity that might be affected subtly by three social and possible. Second, it is not easy to fake
or interpersonal factors, namely, demand character- responses without being detected by the researcher,
istics, subject effects, and the experimenter’s expec- especially when measured with a well-defined task
tancy effects. These artifacts call into question the (e.g., the attention span task). Third, it is not pos-
credibility, generality, and objectivity, respectively, sible to fake performance that exceeds the partici-
of research data. However, these artifacts may be pants’ capability.
better known as social psychology of nonexperi-
mental research (SPONE) because they apply only
Nonexperiment Versus Experiment
to nonexperimental research.
Faking on the part of research participants is
not an issue when experimental conclusions are
The SPOPE Argument
based on subjects’ differential performance on the
Willing to participate and being impressed by the attention span task in two or more conditions with
aura of scientific investigation, research participants proper controls. Suppose that a properly selected
Experimenter Expectancy Effect 453
Control Variable
Test Condition Independent Variable IQ Sex Age Control Procedure Dependent Variable
Experimental Drug Normal M 1215 Random assignment Longer attention span
Control Placebo Normal M 1215 Repeated measures Shorter attention span
Table 2 A Schematic Representation of the Design of Goldstein, Rosnow, Goodstadt, and Sul’s (1972) Study of
Verbal Conditioning
Experimental Control Condition
Condition (Knowledgeable (Not knowledgeable of Mean of Difference Between
Subject Group of verbal conditioning) verbal conditioning) Two Means Two Means
Volunteers X1 (6) X2 (3) X (4.5) d1 ¼ X1 X2 ð6 3Þ ¼ 3
Nonvolunteers Y 1 (4) Y 2 (1) Y (2.5) d2 ¼ Y 1 Y 2 ð4 1Þ ¼ 3
Source: Goldstein, J. J., Rosnow, R. L., Goodstadt, B., & Suls, J. M. (1972). The ‘‘good subject’’ in verbal operant
conditioning research. Journal of Experimental Research in Personality, 6, 2933.
Notes: Hypothetical mean increase in the number of first-person pronouns used. The numbers in parentheses are added for
illustrative purposes only. They were not Goldstein et al.’s (1972) data.
sample of boys is assigned randomly to the two Individual Differences Versus Their Effects on Data
conditions in Table 1. Further suppose that one
Data shown in Table 2 have been used to
group fakes to do well, and the other fakes to do
support the subject effect artifact. The experi-
poorly. Nonetheless, it is unlikely that the said
ment was carried out to test the effect of volun-
unprincipled behavior would produce the differ-
teering on how fast one could learn. Subjects
ence between the two conditions desired by the
were verbally reinforced for uttering first-person
experimenter.
pronouns. Two subject variables are used (volun-
teering status and knowledgeability of condition-
Difference Is Not Absolute ing principle).
Data obtained from college or university stu- Of interest is the statistically significant main
dents do not necessarily lack generality. For exam- effect of volunteering status. Those who volun-
ple, students also have two eyes, two ears, one teered were conditioned faster than those who did
mouth, and four limbs like typical humans have. not. Note that the two levels of any subject vari-
That is, it is not meaningful to say simply that A able are, by definition, different. Hence, the signifi-
differs from B. It is necessary to make explicit (a) cant main effect of volunteering status is not
the dimension on which A and B differ, and (b) the surprising (see the ‘‘Mean of Two Means’’ column
relevancy of the said difference to the research in in Table 2). It merely confirms a pre-existing indi-
question. vidual difference, but not the required effect of
It is also incorrect to say that researchers individual differences on experimental data. The
employ tertiary students as research participants data do not support the subject effect artifact
simply because it is convenient to do so. On the because the required two-way interaction between
contrary, researchers select participants from spe- volunteering status and knowledgeability of condi-
cial populations in a theoretically guided way tioning is not significant.
when required. For example, they select boys with
normal IQ within a certain age range when they
Experiment Versus Meta-Experiment
study hyperactivity. More important, experimen-
ters assign subjects to test conditions in a theoreti- R. Rosenthal and K. L. Fode were the investi-
cally guided way (e.g., completely random). gators in Table 3 who instructed A, B, C, and D
454 Experimenter Expectancy Effect
Table 3 The Design of Rosenthal and Fode’s (1963) Experimental Study of Expectancy
experimenter expectancy effect artifact (i.e., control is used, in which case it is more appro-
4.05 versus 0.95). priate to characterize the SPOPE phenomenon as
Although Rosenthal and Fode are experimen- SPONE. Those putative artifacts are not applica-
ters, A, B, C, and D are not. All of them col- ble to experimental studies.
lected absolute measurement data in one
condition only, not collecting experimental data Siu L. Chow
of differential performance. To test the experi-
See also Experimental Design
menter expectancy effect artifact in a valid man-
ner, the investigators in Table 3 must give each Further Readings
of A, B, C, and D an experiment to conduct.
Chow, S. L. (1992). Research methods in psychology: A
That is, the experimenter expectancy effect arti- primer. Calgary, Alberta, Canada: Detselig.
fact must be tested with a meta-experiment (i.e., Chow, S. L. (1994). The experimenter’s expectancy effect:
an experiment about experiment), an example of A meta experiment. Zeitschrift für Pädagogische
which is shown in panel (a) of Table 4. Psychologie (German Journal of Educational
Regardless of the expectancy manipulation Psychology), 8, 8997.
(positive [ þ 5], neutral [0], or negative [5]), Goldstein, J. J., Rosnow, R. L., Goodstadt, B., & Suls,
Chow gave each of A, B, F, G, P, and Q an experi- J. M. (1972). The ‘‘good subject’’ in verbal operant
ment to conduct. That is, every one of them conditioning research. Journal of Experimental
obtained from his or her own group of subjects the Research in Personality, 6, 2933.
Orne, M. (1962). On the social psychology of the
differential performance on the photo-rating task
psychological experiment: With particular
between two conditions (Happy Face vs. Sad reference to demand characteristics and their
Face). implications. American Psychologist, 17,
It is said in the experimenter expectancy effect 776783.
argument that subjects behave in the way the Rosenthal, R., & Fode, K. L. (1963). Three experiments
experimenter expects. As such, that statement is in experimenter bias. Psychological Reports, 12,
too vague to be testable. Suppose that a sad face 491511.
was presented. Would both the experimenter (e.g.,
A or Q in Table 4) and subjects (individuals tested
by A or Q) ignore that it was a sad (or happy) face
and identify it as ‘‘successful’’ (or ‘‘unsuccessful’’) EXPLORATORY DATA ANALYSIS
under the ‘‘ þ 5’’ (or ‘‘5’’) condition? Much
depends on the consistency between A’s or Q’s Exploratory data analysis (EDA) is a data-driven
expectation and the nature of the stimulus (e.g., conceptual framework for analysis that is based
happy or sad faces), as both A (or Q) and his or primarily on the philosophical and methodological
her subjects might moderate or exaggerate their work of John Tukey and colleagues, which dates
responses. back to the early 1960s. Tukey developed EDA in
response to psychology’s overemphasis on hypode-
ductive approaches to gaining insight into phe-
Final Thoughts
nomena, whereby researchers focused almost
SPOPE is so called because the distinction exclusively on the hypothesis-driven techniques of
between experimental and nonexperimental confirmatory data analysis (CDA). EDA was not
empirical research has not been made as a result developed as a substitute for CDA; rather, its
of not appreciating the role of control in empiri- application is intended to satisfy a different stage
cal research. Empirical research is an experiment of the research process. EDA is a bottom-up
only when three control features are properly approach that focuses on the initial exploration of
instituted (a valid comparison baseline, con- data; a broad range of methods are used to
stancy of conditions, and procedures for elimi- develop a deeper understanding of the data, gener-
nating artifacts). As demand characteristics, ate new hypotheses, and identify patterns in the
participant effect and expectancy effect may be data. In contrast, CDA techniques are of greater
true of nonexperimental research in which no value at a later stage when the emphasis is on
456 Exploratory Data Analysis
testing previously generated hypotheses and con- purpose for which it is used—namely, to assist the
firming predicted patterns. Thus, EDA offers a dif- development of rich mental models of the data.
ferent approach to analysis that can generate
valuable information and provide ideas for further
investigation. Revelation
EDA encourages the examination of different
ways of describing the data to understand inherent
Ethos patterns and to avoid being fooled by unwarranted
A core goal of EDA is to develop a detailed under- assumptions.
standing of the data and to consider the processes
that might produce such data. Tukey used the Data Description
analogy of EDA as detective work because the The use of summary descriptive statistics offers
process involves the examination of facts (data) a concise representation of data. EDA relies on
for clues, the identification of patterns, the genera- resistant statistics, which are less affected by devi-
tion of hypotheses, and the assessment of how well ant cases. However, such statistics involve a trade-
tentative theories and hypotheses fit the data. off between being concise versus precise; therefore,
EDA is characterized by flexibility, skepticism, an analyst should never rely exclusively on statisti-
and openness. Flexibility is encouraged as it is sel- cal summaries. EDA encourages analysts to exam-
dom clear which methods will best achieve the ine data for skewness, outliers, gaps, and multiple
goals of the analyst. EDA encourages the use of peaks, as these can present problems for numerical
statistical and graphical techniques to understand measures of spread and location. Visual representa-
data, and researchers should remain open to unan- tions of data are required to identify such instances
ticipated patterns. However, as summary measures to inform subsequent analyses. For example, based
can conceal or misrepresent patterns in data, EDA on their relationship to the rest of the data, outliers
is also characterized by skepticism. Analysts must may be omitted or may become the focus of the
be aware that different methods emphasize some analysis, a distribution with multiple peaks may be
aspects of the data at the expense of others; thus, split into different distributions, and skewed data
the analyst must also remain open to alternative may be reexpressed. Inadequate exploration of the
models of relationships. data distribution through visual representations
If an unexpected data pattern is uncovered, the can result in the use of descriptive statistics that are
analyst can suggest plausible explanations that are not characteristic of the entire set of values.
further investigated using confirmatory techniques.
EDA and CDA can supplement each other: Where Data Visualization
the abductive approach of EDA is flexible and
Visual representations are encouraged because
open, allowing the data to drive subsequent
graphs provide parsimonious representations of
hypotheses, the more ambitious and focused
data that facilitate the development of suitable
approach of CDA is hypothesis-driven and facili-
mental models. Graphs display information in
tates probabilistic assessments of predicted pat-
a way that makes it easier to detect unexpected
terns. Thus, a balance is required between an
patterns. EDA emphasizes the importance of using
exploratory and confirmatory lens being applied to
numerous graphical methods to see what each
data; EDA comes first, and ideally, any given study
reveals about the data structure.
should combine both.
Tukey developed a number of EDA graphical
tools, including the box-and-whisker plot, other-
wise known as the box plot. Box plots are useful
Methods
for examining data and identifying potential out-
EDA techniques are often classified in terms of the liers; however, like all data summarization methods,
four Rs: revelation, residuals, reexpression, and they focus on particular aspects of the data. There-
resistance. However, it is not the use of a technique fore, other graphical methods should also be used.
per se that determines whether it is EDA, but the Stem-and-leaf displays provide valuable additional
Exploratory Data Analysis 457
information because all data are retained in a fre- about a model’s misspecifications. EDA thus
quency table, providing a sense of the distribution emphasizes careful examination of residual plots
shape. In addition, dot plots highlight gaps or dense for any additional patterns, such as curves or multi-
parts of a distribution and can identify outliers. ple modes, as this suggests that the selected model
Tukey’s emphasis on graphical data analysis has failed to describe an important aspect of the data.
influenced statistical software programs, which In such instances, further smoothing is required to
now include a vast array of graphical techniques. get at the underlying pattern.
These techniques can highlight individual values EDA focuses on building models and generating
and their relative position to each other, check hypotheses in an iterative process of model specifi-
data distributions, and examine relationships cation. The analyst must be open to alternative
between variables and relationships between models, and thus the residuals of different models
regression lines and actual data. In addition, inter- are examined to see if there is a better fit to the
active graphics, such as linked plots, allow the data. Thus, models are generated, tested, modified,
researcher to select a specific case or cases in one and retested in a cyclical process that should lead
graphical display (e.g., scatterplot) and see the the researcher, by successive approximation,
same case(s) in another display (e.g., histogram). toward a good description of the data. Model
Such an approach could identify cases that are building and testing require heeding data at all
bivariate outliers but not outliers on either of the stages of research, especially the early stages of
two variables being correlated. analysis. After understanding the structure of each
variable separately, pairs of variables are examined
in terms of their patterns, and finally, multivariate
Residuals
models of data can be built iteratively. This itera-
According to Tukey, the idea of data analysis tive process is integral to EDA’s ethos of using the
is explained using the following formula: data to develop and refine models.
DATA ¼ SMOOTH þ ROUGH, or, more for- Suitable models that describe the data ade-
mally, DATA ¼ FIT þ RESIDUALS, based on quately can then be compared to models specified
the idea that the way in which we describe/ by theory. Alternatively, EDA can be conducted on
model data is never completely accurate because one subset of data to generate models, and then
there is always some discrepancy between the confirmatory techniques can be applied subse-
model and the actual data. The smooth is the quently to test these models in another subset. Such
underlying, simplified pattern in the data; for cross-validation means that when patterns are dis-
example, a straight line representing the relation- covered, they are considered provisional until their
ship between two variables. However, as data presence is confirmed in a different data set.
never conform perfectly to the smooth, devia-
tions from the smooth (the model) are termed
Reexpression
the rough (the residuals).
Routine examination of residuals is one of the Real-life data are often messy, and EDA recog-
most influential legacies of EDA. Different models nizes the importance of scaling data in an appro-
produce different patterns of residuals; conse- priate way so that the phenomena are represented
quently, examining residuals facilitates judgment in a meaningful manner. Such data transformation
of a model’s adequacy and provides the means to is referred to as data reexpression and can reveal
develop better models. From an EDA perspective, additional patterns in the data. Reexpression can
the rough is just as important as the smooth and affect the actual numbers, the relative distances
should never be ignored. between the values, and the rank ordering of the
Although residual sums-of-squares are widely numbers. Thus, EDA treats measurement scales as
used as a measure of the discrepancy between the arbitrary, advocating a flexible approach to exami-
model and the data, relying exclusively on this mea- nation of data patterns.
sure of model fit is dangerous as important patterns Reexpression may make data suitable for para-
in the residuals may be overlooked. Detailed exami- metric analysis. For example, nonlinear transfor-
nation of the residuals reveals valuable information mations (e.g., log transformation) can make data
458 Exploratory Factor Analysis
follow a normal distribution or can stabilize the of outliers is examined by comparing the residuals
variances. Reexpression can result in linear rela- from a model based on the entire data set with one
tionships between variables that previously had that excludes outliers. If results are consistent
a nonlinear relationship. Making the distribution across the two models, then either course of action
symmetrical about a single peak makes modeling may be followed. Conversely, if substantial differ-
of the data pattern easier. ences exist, then both models should be reported
and the impact of the outliers needs to be consid-
ered. From an EDA perspective, the question is
Resistance not how to deal with outliers but what can be
learned from them. Outliers can draw attention to
Resistance involves the use of methods that mini- important aspects of the data that were not origi-
mize the influence of extreme or unusual data. Dif- nally considered, such as unanticipated psychologi-
ferent procedures may be used to increase cal processes, and provide feedback regarding
resistance. For example, absolute numbers or rank- model misspecification. This data-driven approach
based summary statistics can be used to summarize to improving models is inherent in the iterative
information about the shape, location, and spread EDA process.
of a distribution instead of measures based on sums.
A common example of this is the use of the median
Conclusion
instead of the mean; however, other resistant central
tendency measures can also be used, such as the tri- EDA is a data-driven approach to gaining familiar-
mean (the mean of the 25th percentile, the 75th per- ity with data. It is a distinct way of thinking about
centile, and the median counted twice). Resistant data analysis, characterized by an attitude of flexi-
measures of spread include the interquartile range bility, openness, skepticism, and creativity to dis-
and the median absolute deviation. covering patterns, avoiding errors, and developing
Resistance is increased by giving greater weight useful models that are closely aligned to data.
to values that are closer to the center of the distri- Combining insights from EDA with the powerful
bution. For example, a trimmed mean may be used analytical tools of CDA provides a robust
whereby data points above a specified value are approach to data analysis.
excluded from the estimation of the mean. Alter-
natively, a Winsorized mean may be used where Maria M. Pertl and David Hevey
the tail values of a distribution are pulled in to
See also Box-and-Whisker Plot; Histogram; Outlier;
match those of a specified extreme score.
Residual Plot; Residuals; Scatterplot
Outliers
Further Readings
Outliers present the researcher with a choice
between including these extreme scores (which may Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.).
(1983). Understanding robust and exploratory data
result in a poor model of all the data) and excluding
analysis. New York: John Wiley.
them (which may result in a good model that
Keel, T. G., Jarvis, J. P., & Muirhead, Y. E. (2009). An
applies only to a specific subset of the original exploratory analysis of factors affecting homicide
data). EDA considers why such extreme values investigations: Examining the dynamics of murder
arise. If there is evidence that outliers were pro- clearance rates. Homicide Studies, 13, 5068.
duced by a different process from the one underly- Tukey, J. W. (1977). Exploratory data analysis. Reading,
ing the other data points, it is reasonable to exclude MA: Addison-Wesley.
the outliers as they do not reflect the phenomena
under investigation. In such instances, the
researcher may need to develop different models to
account for the outliers. For example, outliers may EXPLORATORY FACTOR ANALYSIS
reflect different subpopulations within the data set.
However, often there is no clear reason for the Exploratory factor analysis (EFA) is a multivariate
presence of outliers. In such instances, the impact statistical technique to model the covariance
Exploratory Factor Analysis 459
structure of the observed variables by three sets of patterns, and ei contains measurement errors and
parameters: (a) factor loadings associated with uniqueness. It is almost like a multiple regression
latent (i.e., unobserved) variables called factors, model; however, the major difference from multi-
(b) residual variances called unique variances, and ple regression is that in EFA, the factors are latent
(c) factor correlations. EFA aims at explaining the variables and not observed. The model for EFA is
relationship of many observed variables by a rela- often given in a matrix form:
tively small number of factors. Thus, EFA is con-
sidered one of the data reduction techniques. x ¼ μ þ Λf þ e‚ ð1Þ
Historically, EFA dates back to Charles Spearman’s
where x, μ, and e are p-dimensional vectors, f is
work in 1904, and the theory behind EFA has been
an m-dimensional vector of factors, and Λ is
developed along with the psychological theories of
a p × m matrix of factor loadings. It is usually
intelligence, such as L. L. Thurstone’s multiple fac-
assumed that factors (f) and errors (e) are uncorre-
tor model. Today, EFA is among the most fre-
lated, and different error terms (ei and ej for i 6¼ j)
quently used statistical techniques by researchers
are uncorrelated. From the matrix form of the
in the social sciences and education.
model in Equation 1, we can express the popula-
It is well-known that EFA often gives the solu-
tion variance-covariance matrix (covariance struc-
tion similar to principal component analysis
ture) Σ as
(PCA). However, there is a fundamental difference
between EFA and PCA in that factors are predic- Σ ¼ ΛΦΛ0 þ Ψ ð2Þ
tors in EFA, whereas in PCA, principal compo-
nents are outcome variables created as a linear if factors are correlated, where Φ is an m × m cor-
combination of observed variables. Here, an relation matrix among factors (factor correlation
important note is that PCA is a different method matrix), Λ0 is the transpose of matrix Λ in which
from principal factor analysis (also called the prin- rows and columns of Λ are interchanged (so that
cipal axis method). Statistical software such as Λ0 is a m × p matrix), and Ψ is a p × p diagonal
IBMâ SPSSâ (PASW) 18.0 (an IBM company, for- matrix (all off-diagonal elements are zero due to
merly named PASWâ Statistics) supports both uncorrelated ei) of error or unique variances. When
PCA and principal factor analysis. Another simi- the factors (f ) are not correlated, the factor corre-
larity exists between EFA and confirmatory factor lation matrix is equal to the identity matrix (i.e.,
analysis (CFA). In fact, CFA was developed as Φ ¼ I m ) and the covariance structure is reduced to
a variant of EFA. The major difference between
EFA and CFA is that EFA is typically employed Σ ¼ ΛΛ0 þ Ψ ð3Þ
without prior hypotheses regarding the covariance
structure, whereas CFA is employed to test the For each observed variable, when factors are
prior hypotheses on the covariance structure. not correlated, we can compute the sum of
Often, researchers do EFA and then do CFA using squared factor loadings
a different sample. Note that CFA is a submodel X
m
of structural equation models. It is known that hi ¼ λ2i1 þ λ2i2 þ þ λ2im ¼ λ2ij ‚ ð4Þ
two-parameter item response theory (IRT) is math- j¼1
ematically equivalent to the one-factor EFA with
ordered categorical variables. EFA with binary and which is called the communality of the (ith) vari-
ordered categorical variables can also be treated as able. When factors are correlated, the communal-
a generalized latent variable model (i.e., a general- ity is calculated as
ized linear model with latent predictors).
X
m X
Mathematically, EFA expresses each observed hi ¼ λ2ij þ λij λik φjk : ð5Þ
variable (xi) as a linear combination of factors (f1, j¼1 j6¼k
f2 ; . . . ; fm) plus an error term, that is,
xi ¼ μi þ λi1f1 þ λi2f2 þ þ λimfm þ ei, where m When the observed variables are standardized,
is the number of factors, μi is the population mean the ith communality gives the proportion of vari-
of xi,λijs are called the factor loadings or factor ability of the ith variable explained by the m
460 Exploratory Factor Analysis
factors. It is well-known that the squared multiple solution converges. It obtains factor loading esti-
correlation of the ith variable on the remaining mates using the eigenvalues and eigenvectors of
p 1 variables gives a lower bound for the the matrix R Ψ, where R is the sample correla-
ith communality. tion matrix.
When the factors are uncorrelated, with
a m × m orthogonal matrix T (i.e., TT 0 ¼ I m Þ, the
variance-covariance matrix Σ of the observed vari-
Estimation (Extraction)
ables x under EFA given as Σ ¼ ΛΛ0 þ Ψ can be
There are three major estimation methods routinely rewritten as
used in EFA. Each estimation method tries to mini-
mize a distance between the sample covariance Σ ¼ ΛTT 0 Λ0 þ Ψ ¼ ðΛTÞðΛTÞ0 þ Ψ;
matrix S and model-based covariance matrix (esti- ð6Þ
Ψ ¼ Λ Λ0 þ Ψ‚
mate of Σ based on the EFA model: Σ ¼ ΛΛ0 þ Ψ
because for ease of estimation, initially, the factors where Λ ¼ ΛT. This indicates that the EFA model
are typically assumed to be uncorrelated). The has an identification problem called the indetermi-
first method tries to minimize the trace (i.e., sum nacy. That means that we need to impose at least
of diagonal elements) of (1/2)ðS ΣÞ2 and is m(m 1)/2 constraints on the factor loading matrix
called either least-squares (LS) or unweighted in order to estimate the parameters λij uniquely. For
least-squares (ULS) method. Although LS is fre- example, in the ML estimation, commonly used
quently used in multiple regression, it is not so constraints are to let Λ0 Ψ1 Λ be a diagonal matrix.
common as an estimation method for EFA Rotations (to be discussed) are other ways to
because it is not scale invariant. That is, the impose constraints on the factor loading matrix.
solution is different if we use the sample correla- One can also fix m(m 1)/2 loadings in the upper
tion matrix or the sample variance-covariance triangle of Λ at zero for identification.
matrix. Consequently, the following two meth- In estimation, we sometimes encounter a prob-
ods (both of which are scale invariant) are fre- lem called the improper solution. The most fre-
quently used for parameter estimation in EFA. quently encountered improper solution associated
One of them tries to minimize the trace of with EFA is that certain estimates of unique var-
ð1=2ÞfðS ΣÞS1 g2 and is called the generalized iances in Ψ are negative. Such a phenomenon is
least-squares (GLS) method. Note that S1 is the called the Heywood case. If the improper solution
inverse (i.e., matrix version of reciprocal) of S occurs as a result of sampling fluctuations, it is not
and serves as a weight matrix here. Another scale- of much concern. However, it may be a manifesta-
invariant estimation method tries to minimize tion of model misspecification.
trace (SΣ1 Þ logðdetðSΣ1 ÞÞ p; where det is When data are not normally distributed or con-
the determinant operator and log is the natural tain outliers, better parameter estimates can be
logarithm. This method is called the maximum- obtained when the sample covariance matrix S in
likelihood (ML) method. It is known that when any of the above estimation methods is replaced
the model holds, GLS and ML give asymptotically by a robust covariance matrix. When a sample
(i.e., when sample size is very large) equivalent contains missing values, the S should be replaced
solutions. In fact, the criterion for ML can be by the maximum-likelihood estimate of the popu-
approximated by the trace of (1/2)[(S Σ)Σ1 2 , lation covariance matrix.
with almost the same function to be minimized as
GLS, the only difference being the weight matrix
Number of Factors
S1 replaced by Σ1. When the sample is normally
distributed, the ML estimates are asymptotically We need to determine the number of factors m
most efficient (i.e., when the sample size is large, such that the variance-covariance matrix of
the ML procedure leads to estimates with the observed variables is well approximated by the
smallest variances). Note that the principal factor factor model, and also m should be as small as
method frequently employed as an estimation possible. Several methods are commonly employed
method for EFA is equivalent to ULS when the to determine the number of factors.
Exploratory Factor Analysis 461
Table 1 Sample Correlation Matrix for the Nine Psychological Tests (n ¼ 145)
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 Visual 1.000
x2 Cubes 0.318 1.000
x3 Flags 0.468 0.230 1.000
x4 Paragraph 0.335 0.234 0.327 1.000
x5 Sentence 0.304 0.157 0.335 0.722 1.000
x6 Word 0.326 0.195 0.325 0.714 0.685 1.000
x7 Addition 0.116 0.057 0.099 0.203 0.246 0.170 1.000
x8 Counting 0.314 0.145 0.160 0.095 0.181 0.113 0.585 1.000
x9 Straight 0.489 0.239 0.327 0.309 0.345 0.280 0.408 0.512 1.000
Table 5 Promax Rotation: Factor Pattern Matrix (and standard error in parentheses)
Factor 1 Factor 2 Factor 3 Communality
x1 0.027 (0.041) 0.875 (0.102) 0.022 (0.064) 0.732
x2 0.068 (0.094) 0.359 (0.109) 0.011 (0.088) 0.151
x3 0.169 (0.093) 0.494 (0.107) 0.044 (0.078) 0.325
x4 0.852 (0.044) 0.061 (0.054) 0.032 (0.044) 0.759
x5 0.809 (0.046) 0.007 (0.056) 0.086 (0.050) 0.702
x6 0.794 (0.047) 0.074 (0.060) 0.032 (0.049) 0.672
x7 0.121 (0.054) 0.193 (0.060) 0.785 (0.083) 0.580
x8 0.131 (0.043) 0.126 (0.086) 0.803 (0.087) 0.688
x9 0.065 (0.070) 0.388 (0.107) 0.440 (0.092) 0.517
Components Analysis; Structural Equation research designs. It is also often applied as a substi-
Modeling tute for true experimental research to test hypothe-
ses about cause-and-effect relationships or in
Further Readings situations in which it is not practical or ethically
Comrey, A. L., & Lee, H. B. (1992). A first course in factor acceptable to apply the full protocol of a true
analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. experimental design. Despite studying facts that
Gorsuch, R. (1983). Factor analysis (2nd ed.). Hillsdale, have already occurred, ex post facto research
NJ: Lawrence Erlbaum. shares with experimental research design some of
Harmann, H. H. (1976). Modern factor analysis (3rd its basic logic of inquiry.
ed.). Chicago: University of Chicago Press. Ex post facto research design does not include
Hatcher, L. (1994). A step-by-step approach to using the any form of manipulation or measurement
SAS system for factor analysis and structural equation
before the fact occurs, as is the case in true
modeling. Cary, NC: SAS Institute.
Holzinger, K. J., & Swineford, F. A. (1939). A study in
experimental designs. It starts with the observa-
factor analysis: The stability of a bifactor solution tion and examination of facts that took place
(Supplementary Educational Monograph 48). naturally, in the sense that the researcher did not
Chicago: University of Chicago Press. interfere, followed afterward by the exploration
Hoyle, R. H., & Duvall, J. L. (2004). Determining the of the causes behind the evidence selected for
number of factors in exploratory and confirmatory analysis. The researcher takes the dependent var-
factor analysis. In D. Kaplan (Ed.), The Sage iable (the fact or effect) and examines it retro-
handbook of quantitative methodology for the spectively in order to identify possible causes
social sciences (pp. 301315). Thousand Oaks, CA: and relationships between the dependent vari-
Sage.
able and one or more independent variables.
Mulaik, S. A. (1972). The foundations of factor analysis.
New York: McGraw-Hill.
After the deconstruction of the causal process
Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). responsible for the facts observed and selected
Making sense of factor analysis. Thousand Oaks, CA: for analysis, the researcher can eventually adopt
Sage. a prospective approach, monitoring what hap-
Spearman, C. (1904). General intelligence, objectively pens after that.
determined and measured. American Journal of Contrary to true experimental research, ex
Psychology, 15, 201293. post facto research design looks first to the
Thurstone, L. L. (1947). Multiple factor analysis. effects (dependent variable) and tries afterward
Chicago: University of Chicago Press. to determine the causes (independent variable).
Yanai, H., & Ichikawa, M. (2007). Factor analysis. In
In other words, unlike experimental research
C. R. Rao & S. Sinharay (Eds.), Handbook of
statistics (Vol. 26, pp. 257296). Amsterdam:
designs, the independent variable has already
Elsevier. been applied when the study is carried out, and
Yuan, K.-H., Marshall, L. L., & Bentler, P. M. (2002). A for that reason, it is not manipulated by the
unified approach to exploratory factor analysis with researcher. In ex post facto research, the control
missing data, nonnormal data, and in the presence of of the independent variables is made through
outliers. Psychometrika, 67, 95122. statistical analysis, rather than by control and
experimental groups, as is the case in experimen-
tal designs. This lack of direct control of the
independent variable and the nonrandom selec-
EX POST FACTO STUDY tion of participants are the most important dif-
ferences between ex post facto research and the
Ex post facto study or after-the-fact research is true experimental research design.
a category of research design in which the investi- Ex post facto research design has strengths
gation starts after the fact has occurred without that make it the most appropriate research plan
interference from the researcher. The majority of in numerous circumstances; for instance, when it
social research, in contexts in which it is not possi- is not possible to apply a more robust and rigor-
ble or acceptable to manipulate the characteristics ous research design because the phenomenon
of human participants, is based on ex post facto occurred naturally; or it is not practical to
466 External Validity
manipulate the independent variables; or the not randomly selected (e.g., nonprobabilistic
control of independent variables is unrealistic; or samples: convenient samples, snowball samples),
when such manipulation of human participants which limit the possibility of statistical inference.
is ethically unacceptable (e.g., delinquency, ill- For that reason, findings in ex post facto
nesses, road accidents, suicide). Instead of expos- research design cannot, in numerous cases, be
ing human subjects to certain experiments or generalized or looked upon as being statistically
treatments, it is more reasonable to explore the representative of the population.
possible causes after the fact or event has In sum, ex post facto research design is widely
occurred, as is the case in most issues researched used in social as well as behavioral and biomedical
in anthropology, geography, sociology, and in sciences. It has strong points that make it the most
other social sciences. It is also a suitable research appropriate research design in a number of cir-
design for an exploratory investigation of cause- cumstances as well as limitations that make it
effect relationships or for the identification of weak from the point of view of its internal and
hypotheses that can later be tested through true external validity. It is often the best research design
experimental research designs. that can be used in a specific context, but it should
It has a number of weaknesses or shortcom- be applied only when a more powerful research
ings as well. From the point of view of its inter- design cannot be employed.
nal validity, the two main weak points are the
lack of control of the independent variables and Carlos Nunes Silva
the nonrandom selection of participants or sub-
jects. For example, its capacity to assess con- See also Cause and Effect; Control Group; Experimental
founding errors (e.g., errors due to history, social Design; External Validity; Internal Validity;
interaction, maturation, instrumentation, selec- Nonexperimental Designs; Pre-Experimental Design;
tion bias, mortality) is unsatisfactory in numer- Quasi-Experimental Design; Research Design
ous cases. As a consequence, the researcher may Principles
not be sure that all independent variables that
caused the facts observed were included in the
Further Readings
analysis, or if the facts observed would not have
resulted from other causes in different circum- Bernard, H. R. (1994). Research methods in
stances, or if that particular situation is or is not anthropology: Qualitative and quantitative
a case of reverse causation. It is also open to dis- approaches (2nd ed.). Thousand Oaks, CA: Sage.
cussion whether the researcher will be able to Black, T. R. (1999). Doing quantitative research in the
social sciences: An integrated approach to research
find out if the independent variable made a signif-
design, measurement and statistics. Thousand Oaks,
icant difference or not in the facts observed, con- CA: Sage.
trary to the true experimental research design, in Cohen, L., Manion, L., & Morrison, K. (2007). Research
which it is possible to establish if the indepen- methods in education. London: Routledge.
dent variable is the cause of a given fact or event. Engel, R. J., & Schutt, R. K. (2005). The practice of
Therefore, from the point of view of its internal research in social work. Thousand Oaks, CA: Sage.
validity, ex post facto research design is less per- Ethridge, M. E. (2002). The political research experience.
suasive to determine causality compared to true Readings and analysis (3rd ed.). Armonk, NY: M. E.
experimental research designs. Nevertheless, if Sharpe.
there is empirical evidence flowing from numer-
ous case studies pointing to the existence of
a causal relationship, statistically tested, between
the independent and dependent variables EXTERNAL VALIDITY
selected by the researcher, it can be considered
sound evidence in support of the existence of When an investigator wants to generalize results
a causal relationship between these variables. It from a research study to a wide group of people
has also a number of weaknesses from the point (or a population), he or she is concerned with
of view of its external validity, when samples are external validity. A set of results or conclusions
External Validity 467
from a research study that possesses external valid- anxiety more in relation to women who did not; in
ity can be generalized to a broader group of indivi- fact, a closer analysis might reveal that only those
duals than those originally included in the study. women who exercised regularly in addition to tak-
External validity is relevant to the topic of research ing the supplement reduced their anxiety. In other
methods because scientific and scholarly investiga- words, closer data analysis could reveal that the
tions are normally conducted with an interest in findings do not generalize across all subpopula-
generalizing findings to a larger population of indi- tions of 25-year-old women (e.g., those who do
viduals so that the findings can be of benefit to not exercise) even though they do generalize to the
many and not just a few. In the next three sections, overall target population of 25-year-old women.
the kinds of generalizations associated with exter- The distinction between these two kinds of gen-
nal validity are introduced, the threats to external eralizations is useful because generalizing to spe-
validity are outlined, and the methods to increase cific populations is surprisingly more difficult than
the external validity of a research investigation are generalizing across populations because the former
discussed. typically requires large-scale studies where partici-
pants have been selected using formal random
sampling procedures. This is rarely achieved in
Two Kinds of Generalizations
field research, where large-scale studies pose chal-
Two kinds of generalizations are often of interest lenges for administering treatment interventions
to researchers of scientific and scholarly investiga- and for high-quality measurement, and participant
tions: (a) generalizing research findings to a specific attrition is liable to occur systematically. Instead,
or target population, setting, and time frame; and the more common practice is to generalize findings
(b) generalizing findings across populations, set- from smaller studies, each with its own sample of
tings, and time frames. An example is provided to convenience or accidental sampling (i.e., a sample
illustrate the difference between the two kinds. that is accrued expediently for the purpose of the
Imagine a new herbal supplement is introduced research but provides no guarantee that it formally
that is aimed at reducing anxiety in 25-year-old represents a specific target population), across the
women in the United States. Suppose that a ran- populations, settings, and time frames associated
dom sample of all 25-year-old women has been with the smaller studies. It needs to be noted that
drawn that provides a nationally representative individuals in samples of convenience may belong
sample within known limits of sampling error. to the target population to which one wishes to
Imagine now that the women are randomly generalize findings; however, without formal ran-
assigned to two conditions—one where the women dom sampling, the representativeness of the sam-
consume the herbal supplement as prescribed, and ple is questionable. According to Thomas Cook
the other a control group where the women and Donald Campbell, an argument can be made
unknowingly consume a sugar pill. The two condi- for strengthening external validity by means of
tions or groups are equivalent in terms of their rep- a greater number of smaller studies with samples
resentativeness of 25-year-old women. Suppose of convenience than by a single large study with
that after data analysis, the group that consumed an initially representative sample. Given the fre-
the herbal supplement demonstrated lower anxiety quency of generalizations across populations, set-
than the control group as measured by a paper- tings, and time frames in relation to target
and-pencil questionnaire. The investigator can gen- populations, the next section reviews the threats to
eralize this finding to the average 25-year-old external validity claims associated with this type
woman in the United States, that is, the target of generalization.
population of the study. Note that this finding can
be generalized to the average 25-year-old woman
Threats to External Validity
despite possible variations in how differently
women in the experimental group reacted to the To be able to generalize research findings across
supplement. For example, a closer analysis of the populations, settings, and time frames, the inves-
data might reveal that women in the experimental tigator needs to have evidence that the research
group who exercised regularly reduced their findings are not unique to a single population,
468 External Validity
but rather apply to more than one population. and, possibly, are more susceptible to the effects of
One source for this type of evidence comes from an herbal supplement than other women. Thus,
examining statistical interactions between vari- recruiting participants from a variety of locations
ables of interest. For example, in the course of and making participation as convenient as possible
data analysis, an investigator might find that should be undertaken.
consuming an herbal supplement (experimental
treatment) statistically interacts with the activity
Setting and Treatment Interaction
level of the women participating in the study,
such that women who exercise regularly benefit Just as the selection of participants can inter-
more from the anxiety-reducing effects of the act with the treatment, so can the setting in
supplement relative to women who do not exer- which the study takes place. This type of interac-
cise regularly. What this interaction indicates is tion is more applicable to research studies where
that the positive effects of the herbal supplement participants experience an intervention that
cannot be generalized equally to all subpopula- could plausibly change in effect depending on
tions of 25-year-old women. The presence of the context, such as in educational research or
a statistical interaction means that the effect of organizational psychological investigations.
the variable of interest (i.e., consuming the However, to continue with the health supple-
herbal supplement) changes across levels of ment example, suppose the investigator requires
another variable (i.e., activity levels of 25-year- the participants to consume the health supple-
old women). In order to generalize the effects of ment in a laboratory and not in their homes.
the herbal supplement across subpopulations of Imagine that the health supplement produces
25-year-old women, a statistical interaction can- better results when the participant ingests it at
not be observed between the two variables of home and produces worse results when the par-
interest. Many interactions can threaten the ticipant ingests it in a laboratory setting. If the
external validity of a study. These are outlined as investigator varies the settings in the study, it is
follows. possible to test the statistical interaction between
the setting in which the supplement is ingested
and the herbal supplement treatment. Again, the
Participant Selection and Treatment Interaction
absence of a statistical interaction between the
To generalize research findings across popula- setting and the treatment variable would indicate
tions of interest, it is necessary to recruit partici- that the research findings can be generalized
pants in an unbiased manner. For example, when across the two settings; the presence of an inter-
recruiting female participants to take part in an action would indicate that the findings cannot be
herbal supplement study, if the investigator adver- generalized across the settings.
tises the study predominantly in health food stores
and obtains the bulk of participants from this loca-
History and Treatment Interaction
tion, then the research findings may not generalize
to women who do not visit health food stores. In In some cases, the historical time in which the
other words, there may be something unique to treatment occurs is unique and could contribute to
those women who visit health food stores and either the presence or absence of a treatment
decide to volunteer in the study that may make effect. This is a potential problem because it means
them more disposed to the effects of a health sup- that whatever effect was observed cannot be gener-
plement. To counteract this potential bias, the alized to other time frames. For example, suppose
investigator could systematically advertise the that the herbal supplement is taken by women dur-
study in other kinds of food stores to test whether ing a week in which the media covers several high-
the selection of participants from different loca- profile optimistic stories about women. It is rea-
tions interacts with the treatment. If the statistical sonable for an investigator to inquire whether the
interaction is absent, then the investigator can be positive results of taking an herbal supplement
confident that the research findings are not exclu- would have been obtained during a less eventful
sive to those women who visit health food stores week. One way to test for the interaction between
External Validity 469
historical occurrences and treatment is to adminis- allow one to conclude is that an effect has or has
ter the study at different time frames and to repli- not been obtained within a specific range of cate-
cate the results of the study. gories of persons, settings, and times. In other
words, one can claim that ‘‘in at least one sample
of boys and girls, the mathematics intervention
Methods to Increase External Validity
had the effect of increasing test scores.’’
If one wishes to generalize research findings to tar- There are other methods to increase external
get populations, it is appropriate to outline a sam- validity, such as the impressionistic modal instance
pling frame and select instances so that the sample model, where the investigator samples purposively
is representative of the population to which one for specific types of instances. Using this method,
wishes to generalize within known limits of sam- the investigator specifies the category of person,
pling error. Procedures for how to do this can be setting, or time to which he or she wants to gener-
found in textbooks on sampling theory. Often, the alize and then selects an instance of each category
most representative samples will be those that have that is impressionistically similar to the category
been selected randomly from the population of mode. This method of selecting instances is most
interest. This method of random sampling for rep- often used in consulting or project evaluation work
resentativeness requires considerable resources and where broad generalizations are not required. The
is often associated with large-scale studies. After most powerful method for generalizing research
participants have been randomly selected from the findings, especially if the generalization is to a target
population, participants can then be randomly population, is random sampling for representative-
assigned to experimental groups. ness. The next most powerful method is random
Another method for increasing external validity sampling for heterogeneity, with the method of
involves sampling for heterogeneity. This method impressionistic modal instance being the least pow-
requires explicitly defining target categories of per- erful. The power of the model decreases as the nat-
sons, settings, and time frames to ensure that ural assortment of individuals in the sample
a broad range of instances from within each cate- dwindles. However, practical concerns may prevent
gory is represented in the design of the study. For an investigator from using the most powerful
example, an educational researcher interested in method.
testing the effects of a mathematics intervention
might design the study to include boys and girls Jacqueline P. Leighton
from both public and private schools located in
See also Inference: Deductive and Inductive;
small rural towns and large metropolitan cities.
Interaction; Research Design Principles;
The objective would then be to test whether the
Sampling; Theory
intervention has the same effect in all categories
(e.g., whether the mathematics intervention leads
to the same effect in boys and girls, public and pri-
vate schools, and rural and metropolitan areas). Further Readings
Testing for the effect in each of the categories
requires a sufficiently large sample size in each of Cook, T. D., & Campbell, D. T. (1979). Quasi-
the categories. Deliberate sampling for heterogene- experimentation: Design and analysis issues for field
ity does not require random sampling at any stage settings. Boston: Houghton Mifflin.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
in the design, so it is usually viable to implement
Experimental and quasi-experimental designs for
in cases where investigators are limited by generalized causal inference. Boston: Houghton
resources and in their access to participants. How- Mifflin.
ever, deliberate sampling does not allow one to Tebes, J. K. (2000). External validity and
generalize from the sample to any formally speci- scientific psychology. American Psychologist, 55(12),
fied population. What deliberate sampling does 15081509.
F
what validity is. Validity is commonly defined as
FACE VALIDITY a question: ‘‘To what extent do the research con-
clusions provide the correct answer?’’ In testing
the validity of research conclusions, one looks at
Face validity is a test of internal validity. As the the relationship of the purpose and context of the
name implies, it asks a very simple question: ‘‘On research project to the research conclusions. Valid-
the face of things, do the investigators reach the ity is determined by testing (questions of validity)
correct conclusions?’’ It requires investigators to research observations against what is already
step outside of their current research context and known in the world, giving the phenomenon that
assess their observations from a commonsense per- researchers are analyzing the chance to prove them
spective. A typical application of face validity wrong. All tests of validity are context-specific and
occurs when researchers obtain assessments from are not an absolute assessment. Tests of validity
current or future individuals who will be directly are divided into two broad realms: external valid-
affected by programs premised on their research ity and internal validity. Questions of external
findings. An example of testing for face validity is validity look at the generalizability of research
the assessment of a proposed new patient tracking conclusions. In this case, observations generated in
system by obtaining observations from local com- a research project are assessed on their relevance
munity health care providers who will be responsi- to other, similar situations. Face validity falls
ble for implementing the program and getting within the realm of internal validity assessments.
feedback on how they think the new program may A test of internal validity asks if the researcher
work in their centers. draws the correct conclusion based on the avail-
What follows is a brief discussion on how face able data. These types of assessments look into the
validity fits within the overall context of validity nuts-and-bolts of an investigation (for example,
tests. Afterward, documentation of face validity’s looking for sampling error or researcher bias) to
history is reviewed. Here, early criticisms of face see if the research project was legitimate.
validity are addressed that set the stage for how
and why the test returned as a valued assessment.
This discussion of face validity concludes with
some recent applications of the test. History of Face Validity
For all of its simplicity, the test for face validity
has had an amazing and dramatic past that,
The Validity of Face Validity
until recently, has re-emerged as a valued and
To better understand the value and application of respected test of validity. In its early applica-
face validity, it is necessary to first set the stage for tions, face validity was used by researchers as
471
472 Face Validity
a first-step assessment, in concert with other temporarily prevented face validity from getting
tests, to assess the validity of an analysis. During established as a legitimate test of validity (see
the 1940s and 1950s, face validity was used by Table 1).
psychologists when they were in the early stages The first question regarding face validity is over
of developing tests for use in selecting industrial the legitimacy of the test itself. Detractors argue
and military personnel. It was soon widely used that face validity is insignificant because its obser-
by many different types of researchers in differ- vations are not based on any verifiable testing pro-
ent types of investigations, resulting in confusion cedure yielding only rudimentary observations
on what actually constituted face validity. about a study. Face validity does not require a sys-
Quickly, the confusion over the relevance of face tematic method in the obtaining of face validity
validity gave way to its being rejected by observations. They conclude that the only use for
researchers in the 1960s, who took to new and face validity observations is for public relations
more complex tests of validity. statements.
Advocates for face validity see that face validity
provides researchers with the opportunity for com-
Early Debate Surrounding Face Validity
monsense testing of research results: ‘‘After the
Discussions surrounding face validity were revived investigation is completed and all the tests of valid-
in 1985 by Baruch Nevo’s seminal article ‘‘Face ity and reliability are done, does this study make
Validity Revisited,’’ which focused on clearing up sense?’’ Here, tests of face validity allow investiga-
some of the confusion surrounding the test and tors a new way to look at their conclusions to
challenging researchers to take another, more seri- make sure they see the forest for the trees, with the
ous look at face validity’s applications. Building on forest being common sense and the trees being all
Nevo’s research, three questions can be distin- of the different tests of validity used in document-
guished in the research validity literature that have ing the veracity of their study.
Face Validity 473
The second question confuses the value of face research projects. Laypersons lack technical
validity by blurring the applications of face valid- research skills and can provide only impressionistic
ity with content validity. The logic here is that face validity observations, which are of little use to
both tests of validity are concerned with content investigators.
and the representativeness of the study. Content Most researchers now see that the use of
validity is the extent to which the items identified experts in face validity assessments is more accu-
in the study reflect the domain of the concept rately understood as being a test of content valid-
being measured. Because content validity and face ity because they provide their observations at the
validity both look at the degree to which the start or middle of a research project, and face
intended range of meanings in the concepts of the validity focuses on assessing the relevance of
study appear to be covered, once a study has con- research conclusions. Again, content validity
tent validity, it will automatically have face valid- should be understood sequentially in relation to
ity. After testing for content validity, there is no face validity, with the former being used to garner
real need to test for face validity. expert observations on the relevance of research
The other side to this observation is that con- variables in the earlier parts of the investigation
tent validity should not be confused with face from other experts in the field, and face validity
validity because they are completely different tests. should come from laypersons for their common-
The two tests of validity are looking at different sense assessment at the completion of the research
parts of the research project. Content validity is project.
concerned with the relevance of the identified The large-scale vista that defines face validity,
research variables within a proposed research pro- defines the contribution this assessment provides
ject, whereas face validity is concerned with the to the research community, also provides its Achil-
relevance of the overall completed study. Face les heel. Face validity lacks the depth, precision,
validity looks at the overall commonsense assess- and rigor of inquiry that comes with both internal
ment of a study. In addition to the differences and external validity tests. For example, in asses-
between the two tests of validity in terms of what sing the external validity of a survey research pro-
they assess, other researchers have identified ject, one can precisely look at the study’s sample
a sequential distinction between content validity size to determine if it has a representative sample
and face validity. Content validity is a test that of the population. The only question face validity
should be conducted before the data-gathering has for a survey research project is a simple one:
stage of the research project is started, whereas ‘‘Does the study make sense?’’ For this reason, face
face validity should be applied after the investiga- validity can never be a stand-alone test of validity.
tion is carried out. The sequential application of
the two tests is intuitively logical because content
The Re-Emergence of Face Validity
validity focuses on the appropriateness of the iden-
tified research items before the investigation has The renewed interest in face validity is part of the
started, whereas face validity is concerned with the growing research practice of integrating layper-
overall relevance of the research findings after the sons’ nontechnical, one-of-a-kind insights into the
study has been completed. evaluation of applied research projects. Commonly
The third question surrounding face validity known as obtaining an emic viewpoint, testing for
asks a procedural question: Who is qualified to face validity provides the investigator the opportu-
provide face validity observations—experts or lay- nity to learn what many different people affected
persons? Proponents for the ‘‘experts-only’’ by a proposed program already know about a par-
approach to face validity believe that experts who ticular topic. The goal in this application of face
have a substantive knowledge about a research validity is to include the experiential perspectives
topic and a good technical understanding of tests of people affected by research projects in their
of validity provide constructive insights from out- assessment of what causes events to happen, what
side of the research project. In this application of the effects of the study in the community may be,
face validity, experts provide observations that can and what specific words or events mean in the
help in the development and/or fine-tuning of community.
474 Factorial Design
a researcher will use a four-factor design, but these participants chosen, whereas if one or two addi-
situations are extremely rare. When a study incor- tional factors, such as gender or age, are included
porates a large number of factors, other designs are in the design, then the researcher can examine dif-
considered, such as regression. ferences between these specific subsets of partici-
Another way to identify factorial designs is by pants. Another advantage is that the simultaneous
the number of levels for each factor. The simplest effect of the factors operating together can be
design is a 2 × 2, which represents two factors, tested. By examining the interaction between treat-
both of them having two levels. A 3 × 4 design ment and age, the researcher can determine
also has two factors, but one factor has three levels whether the effect of treatment is dependent on
(e.g., type of reward: none, food, money) and age. The youngest participant group may show
the other factor has 4 levels (e.g., age: 68 years, higher scores when receiving Treatment A,
911 years, 1214 years, 1516 years). A 2 × 2 whereas the oldest participant group may show
× 3 design has three factors; for example, gender higher scores when receiving Treatment B.
(2 levels: male, female), instructional method A third advantage of factorial designs is that
(2 levels: traditional, computer-based), and they are more parsimonious, efficient, and power-
ability (3 levels: low, average, high). ful than an examination of each factor in a separate
In a factorial design, each level of a factor is analysis. The principle of parsimony refers to con-
paired with each level of another factor. As such, ducting one analysis to answer all questions rather
the design includes all combinations of the factors’ than multiple analyses. Efficiency is a related prin-
levels, and a unique subset of participants is in ciple. Using the most efficient design is desirable,
each combination. Using the 3 × 4 example in meaning the one that produces the most precise
the previous paragraph, there are 12 cells or sub- estimate of the parameters with the least amount
sets of participants. If a total of 360 participants of sampling error. When additional factors are
were included in the study and group sample sizes added to a design, the error term can be greatly
were equal, then 30 young children (ages 6 to 8) reduced. A reduction of error also leads to more
would receive no reward for completing a task, powerful statistical tests. A factorial design
a different set of 30 young children (ages 6 to 8) requires fewer participants in order to achieve the
would receive food for completing a task, and yet same degree of power as in a single-factor design.
a different set of 30 young children (ages 6 to 8)
would receive money for completing a task. Simi-
Analysis and Interpretation
larly, unique sets of 30 children would be found in
the 911, 1214, and 1516 age ranges. The statistical technique used for answering ques-
This characteristic separates factorial designs tions from a factorial design is the analysis of vari-
from other designs that also involve categorical ance (ANOVA). A factorial ANOVA is an
independent variables and continuous dependent extension of a one-factor ANOVA. A one-factor
variables. For instance, a repeated measures design ANOVA involves one independent variable and
requires the same participant to be included in one dependent variable. The F-test statistic is used
more than one level of an independent variable. If to test the null hypothesis of equality of group
the 3 × 4 example was changed to a repeated means. If the dependent variable is reaction time
measures design, then each participant would be and the independent variable has three groups,
exposed to tasks involving the three different types then the null hypothesis states that the mean reac-
of rewards: none, food, and money. tion times for Groups 1, 2, and 3 are equal. If the
F test leads to rejection of the null hypothesis, then
the alternative hypothesis is that at least one pair
Advantages
of the group mean reaction times is not equal. Fol-
Factorial designs have several advantages. First, low-up analyses are necessary to determine which
they allow for a broader interpretation of results. pair or pairs of means are unequal.
If a single-factor design was used to examine treat- Additional null hypotheses are tested in a facto-
ments, the researcher could generalize results only rial ANOVA. For a two-factor ANOVA, there are
to the characteristics of the particular group of three null hypotheses. Two of them assess main
476 Factorial Design
effects, that is, the independent effect of each inde- Example of Math Instruction Study
pendent variable on the dependent variable. A
The following several paragraphs illustrate the
third hypothesis assesses the interaction effect of
application of factorial ANOVA for an experiment
the two independent variables on the dependent
in which a researcher wants to compare effects of
variable. For a three-factor ANOVA, there are
two methods of math instruction on math compre-
seven null hypotheses: (a) three main-effect
hension. Method A involves a problem-solving and
hypotheses, one for each independent variable;
reasoning approach, and Method B involves a more
(b) three two-factor interaction hypotheses, one
traditional approach that focuses on computation
for each unique pair of independent variables; and
and procedures. The researcher also wants to deter-
(c) one three-factor interaction hypothesis that
mine whether the methods lead to different levels
examines whether a two-factor interaction is gen-
of math comprehension for male versus female stu-
eralizable across levels of the third factor. It is
dents. This is an example of a 2 × 2 factorial
important to note that each factorial ANOVA
design. There are two independent variables:
examines only one dependent variable. Therefore,
method and gender. Each independent variable has
it is called a univariate ANOVA. When more than
two groups or levels. The dependent variable is
one dependent variable is included in a single pro-
represented by scores on a mathematics compre-
cedure, a multivariate ANOVA is used.
hension assessment. Total sample size for the study
is 120, and there are 30 students in each combina-
tion of gender and method.
Model Assumptions
Similar to one-factor designs, there are three
Matrix of Sample Means
model assumptions for a factorial analysis: nor-
mality, homogeneity of variance, and indepen- Before conducting the ANOVA, the researcher
dence. First, values on the dependent variable examined a matrix of sample means. There are
within each population group must be normally three types of means in a factorial design: cell
distributed around the mean. Second, the popula- means, marginal means, and an overall (or
tion variances associated with each group in the grand) mean. In a 2 × 2 design, there are four
study are assumed to be equal. Third, one partici- cell means, one for each unique subset of partici-
pant’s value on the dependent variable should not pants. Table 1 shows that the 30 male students
be influenced by any other participant in the study. who received Method A had an average math
Although not an assumption per se, another comprehension score of 55. Males in Method B
requirement of factorial designs is that each sub- had an average score of 40. Females’ scores were
sample should be a random subset from the popu- lower but had the same pattern across methods
lation. Prior to conducting statistical analysis, as males’ scores. The 30 female students in
researchers should evaluate each assumption. If Method A had an average score of 50, whereas
assumptions are violated, the researcher can either the females in Method B had a score of 35.
(a) give evidence that the inferential tests are The second set of means is called marginal
robust and the probability statements remain valid means. These means represent the means for all
or (b) account for the violation by transforming students in one group of one independent variable.
variables, use statistics that adjust for the viola- Gender marginal means represent the mean of all
tion, or use nonparametric alternatives. 60 males (47.5) regardless of which method they
Factorial Design 477
Table 2 ANOVA Summary Table for the variation. Within rows, there are several elements:
Mathematics Instruction Study sum of squares (SS), degrees of freedom (df), mean
SS df MS F p square (MS), F statistic, and significance level (p).
Method 6157.9 1 6157.9 235.8 < .001 The last row in the table represents the total
Gender 879.0 1 879.0 33.7 < .001 variation in the data set. SS(total) is obtained by
Method × 1.9 1 1.9 0.1 .790 determining the deviation between each individual
Gender raw score and the overall mean, squaring the
Within (error) 3029.3 116 26.1 deviations, and obtaining the sum. The other rows
partition this total variation into four components.
Total 10068.2 119
Three rows represent between variation, and one
represents error variation. The first row in Table 2
received, and likewise the mean of all 60 females shows the between source of variation due to
(42.5). Method marginal means represent the mean method. To obtain SS(method), each method mean
of all 60 students who received Method A (52.5) is subtracted from the overall mean, the deviations
regardless of gender, and the mean of 60 students are squared, multiplied by the group sample size,
who received Method B (37.5). Finally, the overall and then summed. The degrees of freedom for
mean is the average score for all 120 students method is the number of groups minus 1. For the
(45.0) regardless of gender or method. between source of variation due to gender, the sum
of squares is found in a similar way by subtracting
The F Statistic the gender group means from the overall mean.
The third row is the between source of variation
An F-test statistic determines whether each of the
accounted for by the interaction between method
three null hypotheses in the two-factor ANOVA
and gender. The sum of squares is the overall mean
should be rejected or not rejected. The concept of
minus the effects of method and gender plus the
the F statistic is similar to that of the t statistic for
individual cell effect. The degrees of freedom are
testing the significance of two group means. It is
the product of the method and gender degrees of
a ratio of two values. The numerator of the F ratio
freedom. The fourth row represents the remaining
is the variance that can be attributed to the observed
unexplained variation not accounted for by the
differences between the group means. The denomi-
two main effects and the interaction effect. The
nator is the amount of variance that is ‘‘left over,’’
sum of squares for this error variation is obtained
that is, the amount of variance due to differences
by finding the deviation between each individual
among participants within groups (or error). There-
raw score and the mean of the subgroup to which
fore, the F statistic is a ratio between two
it belongs, squaring that deviation, and then sum-
variances—variance attributable ‘‘between’’ groups
ming all deviations. Degrees of freedom for the
and variance attributable ‘‘within’’ groups. Is the
between and within sources of variation add up to
between-groups variance larger than the within-
the df(total), which is the total number of indivi-
groups variance? The larger it is, the larger the F sta-
duals minus 1.
tistic. The larger the F statistic, the more likely it is
As mentioned earlier, mean square represents
that the null hypothesis will be rejected. The
variance. The mean square is calculated in the
observed F (calculated from the data) is compared
same way as the variance for any set of data.
to a critical F at a certain set of degrees of freedom
Therefore, the mean square in each row of Table 2
and significance level. If the observed F is larger than
is the ratio between SS and df. Next, in order to
the critical F, then the null hypothesis is rejected.
make the decision about rejecting or not rejecting
each null hypothesis, the F ratio is calculated.
Partitioning of Variance
Because the F statistic is the ratio of between to
Table 2 shows results from the two-factor within variance, it is simply obtained by dividing
ANOVA conducted on the math instruction study. the mean square for each between source by the
An ANOVA summary table is produced by statisti- mean square for error. Finally, the p values in
cal software programs and is often presented in Table 2 represent the significance level for each
research reports. Each row identifies a portion of null hypothesis tested.
478 Factorial Design
55
the null hypothesis is rejected, F(1, 116) ¼ 33.7,
50 p < .001. The mean score for all males (47.5) is
50
higher than the mean score for all females
(42.5), regardless of method. Finally, results
45
show no significant interaction between method
40 and gender, F(1, 116) ¼ .1, p ¼ .790, meaning
40
that the difference in mean math scores for
35 Method A versus Method B is the same for males
35
and females. For both genders, the problem-solv-
ing approach produced a higher mean score than
30
the traditional approach. Figure 1 shows the
Males Females four cell means plotted on a graph. The lines are
Gender parallel, indicating no interaction between
Method method and gender.
Method A Method B
60 60
55
55 55 55 55
50
50
50
45 45
45
40 40 40
40
40
35
35
30
30
Males Females
Males Females Gender
Gender Method
Method A Method B
Method
Method A Method B
means for younger and older participants. Addi- method designed to explain the correlations
tional tests would compare the mean age differ- between observed variables using a smaller num-
ences for Conditions A and B, Conditions A and C, ber of factors. Because factor analysis is a widely
and so on. used method in social and behavioral research, an
in-depth examination of factor loadings and the
related factor-loading matrix will facilitate a better
Effect Sizes
understanding and use of the technique.
A final note about factorial designs concerns the
practical significance of the results. Every research Factor Analysis and Factor Loadings
study involving factorial analysis of variance should
include a measure of effect size. Tables are available Factor loadings are coefficients found in either
for Cohen’s f effect size. Descriptive labels have a factor pattern matrix or a factor structure
been attached to the f values (.1 is small, .25 is matrix. The former matrix consists of regression
medium, and greater than .40 is large), although coefficients that multiply common factors to pre-
the magnitude of the effect size obtained in a study dict observed variables, also known as manifest
should be interpreted relative to other research in variables, whereas the latter matrix is made up of
the particular field of study. Some widely available product-moment correlation coefficients between
software programs report partial eta-squared common factors and observed variables.
values, but they overestimate actual effect sizes. The pattern matrix and the structure matrix are
Because of this positive bias, some researchers pre- identical in orthogonal factor analysis where com-
fer to calculate omega-squared effect sizes. mon factors are uncorrelated. This entry primarily
examines factor loadings in this modeling situation,
Carol S. Parke which is most commonly seen in applied research.
Therefore, the majority of the entry content is
See also Dependent Variable; Effect Size, Measures of;
devoted to factor loadings, which are both regres-
Independent Variable; Main Effects; Post Hoc
sion coefficients in the pattern matrix and correla-
Analysis; Repeated Measures Design
tion coefficients in the structure matrix. Factor
loadings in oblique factor analysis are briefly dis-
Further Readings cussed at the end of the entry, where common fac-
Glass, G. V, & Hopkins, K. D. (1996). Statistical tors are correlated and the two matrices differ.
methods in education and psychology (3rd ed.). Besides, factor analysis could be exploratory
Boston: Allyn & Bacon. (EFA) or confirmatory (CFA). EFA does not assume
Green, S. B., & Salkind, N. J. (2005). Using SPSS for any model a priori, whereas CFA is designed to
Windows: Analyzing and understanding data (4th ed.). confirm a theoretically established factor model.
Upper Saddle River, NJ: Prentice Hall. [see examples Factor loadings play similar roles in these two mod-
of conducting simple main effects and interaction eling situations. Therefore, in this entry on factor
comparisons for significant two-factor interactions]
loadings, the term factor analysis refers to both
Howell, D. C. (2002). Statistical methods for psychology
(5th ed.). Pacific Grove, CA: Wadsworth.
EFA and CFA, unless stated otherwise.
Stevens, J. (2007). Intermediate statistics: A modern
approach (3rd ed.). Mahwah, NJ: Lawrence Erlbaum. Overview
Warner, R. M. (2008). Applied statistics: From bivariate
through multivariate techniques. Thousand Oaks, Factor analysis, primarily EFA, assumes that
CA: Sage. common factors do exist that are indirectly mea-
sured by observed variables, and that each
observed variable is a weighted sum of common
factors plus a unique component. Common factors
FACTOR LOADINGS are latent and they influence one or more observed
variables. The unique component represents all
Factor loadings are part of the outcome from fac- those independent things, both systematic and ran-
tor analysis, which serves as a data reduction dom, that are specific to a particular observed
Factor Loadings 481
variable. In other words, a common factor is Participant A, consists of two parts. Part 1 is the
loaded by at least one observed variable, whereas average score on this item for all participants
each unique component corresponds to one and having identical levels of introversion and extro-
only one observed variable. Factor loadings are version as Participant A, and this average item
correlation coefficients between observed variables score is denoted by a constant times this partici-
and latent common factors. pant’s level of introversion plus a second constant
Factor loadings can also be viewed as standard- times his or her level of extroversion. Part 2 is the
ized regression coefficients, or regression weights. unique component that indicates the amount of
Because an observed variable is a linear combina- difference between the item score from Participant
tion of latent common factors plus a unique com- A and the said average item score. Obviously, such
ponent, such a structure is analogous to a multiple a two-part scenario is highly similar to a descrip-
linear regression model where each observed vari- tion of regression analysis.
able is a response and common factors are predic- In the above example, factor loadings for this
tors. From this perspective, factor loadings are item, or this observed variable, are nothing but the
viewed as standardized regression coefficients two constants that are used to multiply introver-
when all observed variables and common factors sion and extroversion. There is a set of factor load-
are standardized to have unit variance. Stated dif- ings for each item or each observed variable.
ferently, factor loadings can be thought of as an
optimal set of regression weights that predicts an
Factor Loadings in a Mathematical Form
observed variable using latent common factors.
Factor loadings usually take the form of a matrix, The mathematical form of a factor analysis
and this matrix is a standard output of almost all model takes the following form:
statistical software packages when factor analysis is
performed. The factor loading matrix is usually x ¼ Λf þ ε,
denoted by the capital Greek letter , or lambda, where x is a p-variate vector of standardized,
whereas its matrix entries, or factor loadings, are observed data; Λ is a p × m matrix of factor load-
denoted by λij with i being the row number and j ings; f is an m-variate vector of standardized, com-
the column number. The number of rows of the mon factors; and ε is a p-variate vector of
matrix equals that of observed variables and the standardized, unique components.
number of columns that of common factors. Back to the previous example, in which p ¼ 5
and m ¼ 2. So the factor loading matrix Λ is
Factor Loadings in a Hypothetical Example a 5 × 2 matrix consisting of correlation coeffi-
cients. The above factor analysis model can be
A typical example involving factor analysis is to written in another form for each item:
use personality questionnaires to measure underly-
8
ing psychological constructs. Item scores are > x1 ¼ λ11 f1 þ λ12 f2 þ ε1 ,
>
>
observed data, and common factors correspond to >
< x2 ¼ λ21 f1 þ λ22 f2 þ ε2 ,
latent personality attributes. x3 ¼ λ31 f1 þ λ32 f2 þ ε3 ,
Suppose a psychologist is developing a theory >
>
>
> x ¼ λ41 f1 þ λ42 f2 þ ε4 ,
that hypothesizes there are two personality attri- : 4
x5 ¼ λ51 f1 þ λ52 f2 þ ε5 :
butes that are of interest, introversion and extro-
version. To measure these two latent constructs, Each of the five equations corresponds to one
the psychologist develops a 5-item personality item. In other words, each observed variable is
instrument and administers it to a randomly represented by a weighted linear combination of
selected sample of 1,000 participants. common factors plus a unique component. And
Thus, each participant is measured on five for each item, two factor loading constants bridge
variables, and each variable can be modeled as observed data and common factors. These con-
a linear combination of the two latent factors stants are standardized regression weights because
plus a unique component. Stated differently, the observed data, common factors, and unique com-
score on an item for one participant, say, ponents are all standardized to have zero mean
482 Factor Loadings
and unit variance. For example, in determining A factor loading that falls outside of the interval
standardized x1, f1 is given the weight λ11 and f2 is bounded by ( ± cutoff value) is considered to be
given the weight λ12, whereas in determining stan- large and is thus retained. On the other hand, a fac-
dardized x2, f1 is given the weight λ21 and f2 is tor loading that does not meet the criterion indi-
given the weight λ22. cates that the corresponding observed variable
The factor loading matrix can be used to should not load on the corresponding common
define an alternative form of factor analysis factor. The cutoff value is arbitrarily selected
model. Suppose the observed correlation matrix depending on the field of study, but ( ± 0.4) seems
and the factor model correlation matrix are RX to be preferred by many researchers.
and RF, respectively. The following alternative A factor loading can also be t tested, and the
factor model can be defined using the factor null hypothesis for this test is that the loading is
loading matrix: not significantly different from zero. The com-
puted t statistic is compared with the threshold
RF ¼ ΛΛT þ ψ, chosen for statistical significance. If the computed
value is larger than the threshold, the null is
where ψ is a diagonal matrix of unique variances. rejected in favor of the alternative hypothesis,
Some factor analysis algorithms iteratively solve which states that the factor loading differs signifi-
for Λ and ψ so that the difference between RX cantly from zero.
and RF is minimized. A confidence interval (CI) can be constructed
for a factor loading, too. If the CI does not cover
zero, the corresponding factor loading is signifi-
Communality and Unique Variance cantly different from zero. If the CI does cover
zero, no conclusion can be made regarding the
Based on factor loadings, communality and
significance status of the factor loading.
unique variance can be defined. These two con-
cepts relate to each observed variable.
The communality for an observed variable Rotated Factor Loadings
refers to the amount of variance in that variable
The need to rotate a factor solution relates to
that is explained by common factors. If the com-
the factorial complexity of an observed variable,
munality value is high, at least one of the common
which refers to the number of common factors
factors has a substantial impact on the observed
that have a significant loading for this variable. In
variable. The sum of squared factor loadings is the
applied research, it is desirable for an observed
communality value for that observed variable.
variable to load significantly on one and only one
And most statisticians use h2i to denote the com-
common factor, which is known as a simple struc-
munality value for the ith observed variable.
ture. For example, a psychologist prefers to be able
The unique variance for an observed variable is
to place a questionnaire item into one and only
computed as 1 minus that variable’s communality
one subscale.
value. The unique variance represents the amount
When an observed variable loads on two or
of variance in that variable that is not explained
more factors, a factor rotation is usually performed
by common factors.
to achieve a simple structure, which is a common
practice in EFA. Of all rotation techniques, vari-
Issues Regarding Factor Loadings max is most commonly used. Applied researchers
usually count on rotated factor loadings to inter-
Significance of Factor Loadings
pret the meaning of each common factor.
There are usually three approaches to the deter-
mination of whether or not a factor loading is
Labeling Common Factors
significant: cutoff value, t test, and confidence
interval. However, it should be noted that the In EFA, applied researchers usually use factor
latter two are not commonly seen in applied loadings to label common factors. EFA assumes
research. that common factors exist, and efforts are made to
False Positive 483
polygraph identifies as honest and is, in fact, hon- clear-cut. As an example, although the conse-
est. A false positive is a person whom the poly- quence of failure to diagnose breast cancer (a false
graph identifies as dishonest but is, in fact, honest. negative) could cost a woman her life, the conse-
A false negative is a person whom the polygraph quences associated with a woman being a false
identifies as honest but is, in fact, dishonest. positive could range from minimal (e.g., the
The final example involves the use of an integ- woman is administered a relatively benign form of
rity test, which is commonly used in business and chemotherapy) to severe (e.g., the woman has an
industry in assessing the suitability of a candidate unnecessary mastectomy).
for a job. Although some people believe that integ- In contrast to medicine, the American legal sys-
rity tests are able to identify individuals who will tem tends to view a false positive as a more serious
steal from an employer, the more general consen- error than a false negative. The latter is reflected in
sus is that such tests are more likely to identify the use of the ‘‘beyond a reasonable doubt’’ stan-
individuals who will not be conscientious employ- dard in criminal courts, which reflects the belief
ees. In the case of an integrity test, the condition that it is far more serious to find an innocent person
that the test is employed to identify is the unsuit- guilty than to find a guilty person innocent. Once
ability of a job candidate. With regard to a person’s again, however, the consequences associated with
performance on an integrity test, a true positive is the relative seriousness of both types of errors may
a person whom the test identifies as an unsuitable vary considerably depending upon the nature of the
employee and, in fact, will be an unsuitable crime involved. For example, one could argue that
employee. A true negative is a person whom the finding a serial killer innocent (a false negative)
test identifies as a suitable employee and, in fact, constitutes a far more serious error than wrongly
will be a suitable employee. A false positive is convicting an innocent person of a minor felony (a
a person whom the test identifies as an unsuitable false positive) that results in a suspended sentence.
employee but, in fact, will be a suitable employee.
A false negative is a person whom the test identi-
The Low Base Rate Problem
fies as a suitable employee but, in fact, will be an
unsuitable employee. The base rate of a behavior or medical condition
is the frequency with which it occurs in a popula-
tion. The low base rate problem occurs when
Relative Seriousness of False a diagnostic test that is employed to identify
a low base rate behavior or condition tends to
Positive Versus False Negative
yield a disproportionately large number of false
It is often the case that the determination of a cut- positives. Thus, when a diagnostic test is
off score on a test (or criterion of performance on employed in medicine to detect a rare disease, it
a polygraph) for deciding to which category a may, in fact, identify virtually all of the people
person will be assigned will be a function of the who are afflicted with the disease, but in the pro-
perceived seriousness of incorrectly categorizing cess erroneously identify a disproportionately large
a person a false positive versus a false negative. number of healthy people as having the disease,
Although it is not possible to state that one type of and because of the latter, the majority of people
error will always be more serious than the other, labeled positive will, in fact, not have the disease.
a number of observations can be made regarding The relevance of the low base rate problem to
the seriousness of the two types of errors. The cri- polygraph and integrity testing is that such instru-
terion for determining the seriousness of an error ments may correctly identify most guilty indivi-
will always be a function of the consequences asso- duals and potentially unsuitable employees, yet, in
ciated with the error. In medicine, physicians tend the process, erroneously identify a large number of
to view a false negative as a more serious error innocent people as guilty or, in the case of an
than a false positive the latter being consistent integrity test, a large number of potentially suit-
with the philosophy that it is better to treat a able employees as unsuitable. Estimates of false
nonexistent illness than to neglect to treat a poten- positive rates associated with the polygraph and
tially serious illness. Yet things are not always that integrity tests vary substantially, but critics of the
486 Falsifiability
latter instruments argue that error rates are unac- Bayes’s theorem represents the probability that
ceptably high. In the case of integrity tests, people a person will be healthy given that his or her diag-
who utilize them may concede that although such nostic test result was positive.
tests may yield a large number of false positives, at
the same time they have a relatively low rate of PðBþ=A2 ÞPðA2 Þ
false negatives. Because of the latter, companies PðA2 =BþÞ ¼
PðBþ=A1 ÞPðA1 Þ þ PðBþ=A2 ÞPðA2 Þ
that administer integrity tests cite empirical evi-
dence that such tests are associated with a decrease When the above noted conditional probability
in employee theft and an increase in productivity. P(A2/Bþ) is multiplied by P(Bþ) (the proportion
In view of the latter, they consider the conse- of individuals in the population who obtain a posi-
quences associated with a false positive (not hiring tive result on the diagnostic test), the resulting
a suitable person) to be far less damaging to the value represents the proportion of false positives in
company than the consequences associated with the population. In order to compute P(A2/Bþ), it
a false negative (hiring an unsuitable person). is necessary to know the population base rates A1
and A2 as well as the conditional probabilities
P(Bþ /A1) and P(Bþ /A2). Obviously, if one or
Use of Bayes’s Theorem for more of the aforementioned probabilities is not
Computing a False Positive Rate known or cannot be estimated accurately, comput-
ing a false positive rate will be problematical.
In instances where the false positive rate for a test
cannot be determined from empirical data, Bayes’s David J. Sheskin
theorem can be employed to estimate the latter.
Bayes’s theorem is a rule for computing condi- See also Bayes’s Theorem; Sensitivity; Specificity; True
tional probabilities that was stated by an 18th- Positive
century English clergyman, the Reverend Thomas
Bayes. A conditional probability is the probability Further Readings
of Event A given the fact that Event B has already
Feinstein, A. R. (2002). Principles of medical statistics.
occurred. Bayes’s theorem assumes there are two
Boca Raton, FL: Chapman & Hall/CRC.
sets of events. In the first set, there are n events to Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical
be identified as A1, A2, . . . , An, and in the second methods for rates and proportions (3rd ed.). Hoboken,
set, there are two events to be identified as Bþ NJ: Wiley-Interscience.
and B. Bayes’s theorem allows for the computa- Meehl, P., & Rosen, A. (1955). Antecedent probability
tion of the probability that Aj (where 1 ≤ j ≤ n) and the efficiency of psychometric signs, patterns, or
will occur, given it is known that Bþ has cutting scores. Psychological Bulletin, 52, 194 216.
occurred. As an example, the conditional probabil- Pagano, M., & Gauvreau, K. (2000). Principles of
ity P(A2/Bþ ) represents the probability that Event biostatistics (2nd ed.). Pacific Grove, CA: Duxbury.
A2 will occur, given the fact that Event Bþ has Rosner, B. (2006). Fundamentals of biostatistics (6th ed.).
Belmont, CA: Thomson-Brooks/Cole.
already occurred.
Sheskin, D. J. (2007). Handbook of parametric and
An equation illustrating the application of nonparametric statistical procedures (4th ed.). Boca
Bayes’s theorem is presented below. In the latter Raton, FL: Chapman & Hall/CRC.
equation, it is assumed that Set 1 is comprised of Wiggins, J. (1973). Personality and prediction: Principles
the two events A1 and A2 and Set 2 is comprised of personality assessment. Menlo Park, CA: Addison
of the two events Bþ and B. If it is assumed A1 Wesley.
represents a person who is, in fact, sick; A2 repre-
sents a person who is, in fact, healthy; Bþ indi-
cates a person who received a positive diagnostic
test result for the illness in question; and B indi- FALSIFIABILITY
cates a person who received a negative diagnostic
test result for the illness in question, then the con- The concept of falsifiability is central to distin-
ditional probability P(A2/Bþ ) computed with guishing between systems of knowledge and
Falsifiability 487
understanding, specifically between scientific theo- Popper found similarity between astrologists
ries of understanding the world and those consid- and those who interpret and make predictions
ered nonscientific. The importance of the concept about historical events via Marxian analyses in
of falsifiability was developed most thoroughly by that both have historically sought to verify rather
the philosopher Karl Popper in the treatise Conjec- than falsify their perspectives as a matter of prac-
tures and Refutations: The Growth of Scientific tice. Where a lack of corroboration between reality
Knowledge. Specifically, falsifiability refers to the and theory exists, proponents of both systems rein-
notion that a theory or statement can be found to terpret their theoretical position so as to corre-
be false; for instance, as the result of an empirical spond with empirical observations, essentially
test. undermining the extent to which the theoretical
Popper sought to distinguish between various perspective can be falsified. The proponents of
means of understanding the world in an effort to both pseudoscientific approaches tacitly accept the
determine what constitutes a scientific approach. manifest truth of their epistemic orientations irre-
Prior to his seminal work, merely the empirical spective of the fact that apparent verisimilitude is
nature of scientific investigation was accepted as contingent upon subjective interpretations of his-
the criterion that differentiated it from pseudo- or torical events.
nonscientific research. Popper’s observation that Popper rejected the notion that scientific theo-
many types of research considered nonscientific ries were those thought most universally true,
were also based upon empirical techniques led to given the notion that verifying theories in terms of
dissatisfaction with this conventional explanation. their correspondence to the truth is a quixotic task
Consequently, several empirically based methods requiring omniscience. According to Popper, one
colloquially considered scientific were contrasted cannot predict the extent to which future findings
in an effort to determine what distinguished sci- could falsify a theory, and searching for verifica-
ence from pseudoscience. Examples chosen by tion of the truth of a given theory ignores this
Popper to illustrate the diversity of empirical potentiality. Instead of locating the essence of sci-
approaches included physics, astrology, Marxian ence within a correspondence with truth, Popper
theories of history, and metaphysical analyses. found that theories most scientific were those
Each of these epistemic approaches represents capable of being falsified. This renders all scientific
a meaningful system of interpreting and under- theories tenable at best, in the sense that the most
standing the world around us, and has been used plausible scientific theories are merely those that
earnestly throughout history with varying degrees have yet to be falsified.
of perceived validity and success. Every empirical test of a theory is an attempt to
Popper used the term line of demarcation to dis- falsify it, and there are degrees of testability with
tinguish the characteristics of scientific from respect to theories as a whole. Focusing on falsifi-
nonscientific (pseudoscientific) systems of under- cation relocates power from the extent to which
standing. What Popper reasoned differentiated the a theory corresponds with a given reality or set of
two categories of understanding is that the former circumstances to the extent to which it logically
could be falsified (or found to be not universally can be proven false given an infinite range of
true), whereas the latter was either incapable of empirical possibilities. Contrarily, a hypothetical
being falsified or had been used in such a way that theory that is capable of perfectly and completely
renders falsification unlikely. According to Popper, explaining a given phenomenon is inherently
this usage takes the form of seeking corroboratory unscientific because it cannot be falsified logically.
evidence to verify the verisimilitude of a particular Where theories are reinterpreted to make them
pseudoscientific theory. For example, with respect more compatible with potentially falsifying empiri-
to astrology, proponents subjectively interpret cal information, it is done to the benefit of its cor-
events (data) in ways that corroborate their pre- respondence with the data, but to the detriment of
conceived astrological theories and predictions, the original theory’s claim to scientific status.
rather than attempting to find data that undermine As an addendum, Popper rejected the notion
the legitimacy of astrology as an epistemic that only tenable theories are most useful, because
enterprise. those that have been falsified may illuminate
488 Field Study
constructive directions for subsequent research. the direct manipulation of the environment by the
Thus, the principle of falsificationism does not researcher. However, sometimes, independent and
undermine the inherent meaning behind statements dependent variables already exist within the social
that fall short of achieving its standard of scientific structure under study, and inferences can then be
status. drawn about behaviors, social attitudes, values,
Some competing lines of demarcation in distin- and beliefs. It must be noted that a field study is
guishing scientific from pseudoscientific research separate from the concept of a field experiment.
include the verificationist and anarchistic episte- Overall, field studies belong to the category of
mological perspectives. As previously noted, the nonexperimental designs where the researcher uses
simple standard imposed by verificationism states what already exists in the environment. Alterna-
that a theory is considered scientific merely if it tively, field experiments refer to the category of
can be verified through the use of empirical evi- experimental designs where the researcher follows
dence. A competing line involves Paul Feyera- the scientific process of formulating and testing
bend’s anarchistic epistemological perspective, hypotheses by invariably manipulating some
which holds that any and all statements and theo- aspect of the environment. It is important that pro-
ries can be considered scientific because history spective researchers understand the types, aims,
shows that ‘‘whatever works’’ has been labeled sci- and issues; the factors that need to be considered;
entific regardless of any additional distinguishing and the advantages and concerns raised when con-
criteria. ducting the field study type of research.
Field studies belong to the category of nonex-
Douglas J. Dallier perimental design. These studies include the
case study—an in-depth observation of one orga-
See also External Validity; Hypothesis; Internal Validity;
nization, individual, or animal; naturalistic
Logic of Scientific Discovery, The; Research Design
observation—observation of an environment
Principles; Test; Theory
without any attempt to interfere with variables;
participant observer study—observation through
Further Readings the researcher’s submergence into the group
Feyerabend, P. (1975). Against method: Outline of an under study; and phenomenology—observation
anarchistic theory of knowledge. London: New Left derived from the researcher’s personal experi-
Books. ences. The two specific aims of field studies are
Kuhn, T. (1962). The structure of scientific revolutions. exploratory research and hypothesis testing.
Chicago: University of Chicago Press. Exploratory research seeks to examine what exists
Lakatos, I., Feyerabend, P., & Motterlini, M. (1999). For in order to have a better idea about the dynamics
and against method: Including Lakatos’s lectures on that operate within the natural setting. Here, the
scientific method and the Lakatos-Feyerabend acquisition of knowledge is the main objective.
correspondence. Chicago: University of Chicago Press.
With hypothesis testing, the field study seeks to
Mace, C. A. (Ed.). (1957). Philosophy of science: A
personal report: British philosophy in mid-century.
determine whether the null hypothesis or the alter-
London: Allen and Unwin. native hypothesis best predicts the relationship of
Popper, K. (1962). Conjectures and refutations: The variables in the specific context; assumptions can
growth of scientific knowledge. New York: Basic then be used to inform future research.
Books.
Real-Life Research and Applications
Field studies have often provided information and
FIELD STUDY reference points that otherwise may not have been
available to researchers. For example, the famous
A field study refers to research that is undertaken obedience laboratory experiment by Stanley Mil-
in the real world, where the confines of a labora- gram was criticized on the grounds that persons in
tory setting are abandoned in favor of a natural real-life situations would not unquestioningly
setting. This form of research generally prohibits carry out unusual requests by persons perceived to
Field Study 489
be authority figures as they did in the laboratory dependent variable, and other specific variables of
experiment. Leonard Bickman then decided to test interest that already operate in the natural setting
the obedience hypothesis using a real-life applica- may be identified and, to a lesser extent, controlled
tion. He found that his participants were indeed by the researcher because those variables would
more willing to obey the stooge who was dressed become the focus of the study. Overall, field stud-
as a guard than the one who dressed as a sports- ies tend to capture the essence of human behavior,
man or a milkman. Another example of field particularly when the persons under observation
research usage is Robert Cialidini’s investigation of are unaware that they are being observed, so that
how some professionals, such as con men, sales authentic behaviors are reflected without the influ-
representatives, politicians, and the like, are able ence of demand characteristics (reactivity) or social
to gain compliance from others. In reality, he desirability answers. Furthermore, when observa-
worked in such professions and observed the tion is unobtrusive, the study’s integrity is
methods that these persons used to gain compli- increased.
ance from others. From his actual experiences, he However, because field studies, by their very
was able to offer six principles that cover the com- nature, do not control extraneous variables, it is
pliance techniques used by others. Some field stud- exceedingly difficult to ascertain which factor or
ies take place in the workplace to test attitudes factors are more influential in any particular con-
and efficiency. Therefore, field studies can be con- text. Bias can also be an issue if the researcher is
ducted to examine a multitude of issues that testing a hypothesis. There is also the problem of
include playground attitudes of children, gang replication. Any original field study sample will
behaviors, how people respond to disasters, effi- not be accurately reflective of any other replication
ciency of organization protocol, and even behavior of that sample. Furthermore, there is the issue of
of animals in their natural environment. Informa- ethics. Many times, to avoid reactivity, researchers
tion derived from field studies result in correla- do not ask permission from their sample to
tional interpretations. observe them, and this may cause invasion-of-
privacy issues even though such participants are in
the public eye. For example, if research is being
Strengths and Weaknesses
carried out about the types of kissing that take
Field studies are employed in order to increase eco- place in a park, even though the persons engaged
logical and external validity. Because variables are in kissing are doing so in public, had they known
not directly manipulated, the conclusions drawn that their actions were being videotaped, they may
are deemed to be true to life and generalizable. have strongly objected. Other problems associated
Also, such studies are conducted when there is with field studies include the fact that they can be
absolutely no way of even creating mundane real- quite time-consuming and expensive, especially if
ism in the laboratory. For example, if there is a number of researchers are required as well as
a need to investigate looting behavior and the audiovisual technology.
impact of persons on each other to propel this
behavior, then a laboratory study cannot suffice Indeira Persaud
for the investigation because of the complexity of
See also Ecological Validity; Nonexperimental Design;
the variables that may be involved. Field research
Reactive Arrangements
is therefore necessary.
Although field studies are nonexperimental, this
does not imply that such studies are not empirical.
Scientific rigor is promoted by various means, Further Readings
including the methods of data collection used in
Allen, M. J. (1995). Introduction to psychological
the study. Data can be reliably obtained through research. Itasca, IL: F. E. Peacock.
direct observation, coding, note-taking, the use of Babbie, E. (2004). The practice of social research
interview questions—preferably structured—and (10th ed.). Belmont, CA: Thomson Wadsworth.
audiovisual equipment to garner information. Bickman, L. (1974). Clothes make the person.
Even variables such as the independent variable, Psychology Today, 8(4), 4851.
490 File Drawer Problem
Cialdini, R. B. (2006). Influence: The psychology of whereas null results are inconclusive. Reviewers
persuasion. New York: HarperCollins. who have a professional or financial interest in cer-
Robson, C. (2003). Real world research. Oxford, UK: tain results may also be less accepting of and more
Blackwell. critical toward null results than those that confirm
Solso, R. L., Johnson, H. H., & Beal, M. K. (1998).
their expectations.
Experimental psychology: A case approach. New
York: Longman.
Detection
Three methods are commonly used to evaluate
FILE DRAWER PROBLEM whether publication bias exists within a literature
review. Although one of these methods can be per-
formed using vote-counting approaches to research
The file drawer problem is the threat that the synthesis, these approaches are typically conducted
empirical literature is biased because nonsignifi- within a meta-analysis focusing on effect sizes.
cant research results are not disseminated. The The first method is to compare results of pub-
consequence of this problem is that the results lished versus unpublished studies, if the reviewer
available provide a biased portrayal of what is has obtained at least some of the unpublished
actually found, so literature reviews (including studies. In a vote-counting approach, the reviewer
meta-analyses) will conclude stronger effects than can evaluate whether a higher proportion of pub-
actually exist. The term arose from the image that lished studies finds a significant effect than do the
these nonsignificant results are placed in research- proportion of unpublished studies. In a meta-
ers’ file drawers, never to be seen by others. This analysis, one performs moderator analyses that
file drawer problem also has several similar names, statistically compare whether effect sizes are
including publication or dissemination bias. greater in published versus unpublished studies.
Although all literature reviews are vulnerable to An absence of differences is evidence against a file
this problem, meta-analysis provides methods of drawer problem.
detecting and correcting for this bias. This entry A second approach is through the visual exami-
first discusses the sources of publication bias and nation of funnel plots, which are scatterplots of
then the detection and correction of such bias. each study’s effect size to sample size. Greater vari-
ability of effect sizes is expected in smaller versus
larger studies, given their greater sampling vari-
Sources
ability. Thus, funnel plots are expected to look like
The first source of publication bias is that research- an isosceles triangle, with a symmetric distribution
ers may be less likely to submit null than signifi- of effect sizes around the mean across all levels of
cant results. This tendency may arise in several sample size. However, small studies that happen to
ways. Researchers engaging in ‘‘data snooping’’ find small effects will not be able to conclude sta-
(cursory data analyses to determine whether more tistical significance and therefore may be less likely
complete pursuit is warranted) simply may not to be published. The resultant funnel plot will be
pursue investigation of null results. Even when asymmetric, with an absence of studies in the small
complete analyses are conducted, researchers may sample size/small effect size corner of the triangle.
be less motivated—due to expectations that the A third, related approach is to compute the cor-
results will not be published, professional pride, or relation between effect sizes and sample sizes
financial interest in finding supportive results—to across studies. In the absence of publication bias,
submit results for publication. one expects no correlation; small and large studies
The other source is that null results are less should find similar effect sizes. However, if nonsig-
likely to be accepted for publication than are sig- nificant results are more likely relegated to the file
nificant results. This tendency is partly due to reli- drawer, then one would find that only the small
ance on decision making from a null hypothesis studies finding large effects are published. This
significance testing (versus effect size) framework; would result in a correlation between sample size
statistically significant results lead to conclusions, and effect size (a negative correlation if the average
Fisher’s Least Significant Difference Test 491
effect size is positive and a positive correlation if challenge is that the user typically must specify
the average effect size is negative). a selection model, often with little information.
Noel A. Card
Correction
See also Effect Size, Measures of; Literature Review;
There are four common ways of correcting for the
Meta-Analysis
file drawer problem. The first is not actually a cor-
rection, but an attempt to demonstrate that the
results of a meta-analysis are robust to this prob- Further Readings
lem. This approach involves computing a failsafe Begg, C. B. (1994). Publication bias. In H. Cooper &
number, which represents the number of studies L. V. Hedges (Eds.), The handbook of research
with an average effect size of zero that could be synthesis (pp. 399409). New York: Russell Sage
added to a meta-analysis before the average effect Foundation.
becomes nonsignificant. If the number is large, one Rosenthal, R. (1979). The ‘‘file drawer problem’’ and
concludes that it is not realistic that so many tolerance for null results. Psychological Bulletin, 86,
excluded studies could exist so as to invalidate the 638641.
conclusions, so the review is robust to the file Rothstein, H. R., Sutton, A. J., & Borenstein, M. (Eds.).
(2005). Publication bias in meta-analysis: Prevention,
drawer problem.
assessment and adjustments. Hoboken, NJ: Wiley.
A second approach is to exclude underpowered
studies from a literature review. The rationale for
this suggestion is that if the review includes only
studies of a sample size large enough to detect FISHER’S LEAST SIGNIFICANT
a predefined effect size, then nonsignificant results
should not result in publication bias among this DIFFERENCE TEST
defined set of studies. This suggestion assumes that
statistical nonsignificance is the primary source of When an analysis of variance (ANOVA) gives a sig-
unpublished research. This approach has the dis- nificant result, this indicates that at least one group
advantage of excluding a potentially large number differs from the other groups. Yet the omnibus test
of studies with smaller sample sizes, and therefore does not indicate which group differs. In order to
might often be an inefficient solution. analyze the pattern of difference between means, the
A third way to correct for this problem is ANOVA is often followed by specific comparisons,
through trim-and-fill methods. Although several and the most commonly used involves comparing
variants exist, the premise of these methods is two means (the so-called pairwise comparisons).
a two-step process based on restoring symmetry to The first pairwise comparison technique was
a funnel plot. First, one ‘‘trims’’ studies that are in developed by Ronald Fisher in 1935 and is called
the represented corner of the triangle until a sym- the least significant difference (LSD) test. This
metric distribution is obtained; the mean effect size technique can be used only if the ANOVA F omni-
is then computed from this subset of studies. Sec- bus is significant. The main idea of the LSD is to
ond, one restores the trimmed studies and ‘‘fills’’ compute the smallest significant difference (i.e.,
the missing portion of the funnel plot by imputing the LSD) between two means as if these means
studies to create symmetry; the heterogeneity of had been the only means to be compared (i.e.,
effect sizes is then estimated from this filled set. with a t test) and to declare significant any differ-
A final method of management is through a fam- ence larger than the LSD.
ily of selection (weighted distribution) models.
These approaches use a distribution of publication
Notations
likelihood at various levels of statistical signifi-
cance to weight the observed distribution of effect The data to be analyzed comprise A groups, and
sizes for publication bias. These models are statis- a given group is denoted a. The number of obser-
tically complex, and the field has not reached vations of the ath group is denoted Sa. If all groups
agreement on best practices in their use. One have the same size, the notation S is used. The
492 Fisher’s Least Significant Difference Test
total number of observations is denoted N. The Note that LSD has more power compared to
mean of Group a is denoted Ma þ . From the other post hoc comparison methods (e.g., the hon-
ANOVA, the mean square of error (i.e., within estly significant difference test, or Tukey test) because
group) is denoted MSS(A) and the mean square of the α level for each comparison is not corrected for
effect (i.e., between group) is denoted MSA. multiple comparisons. And, because LSD does not
correct for multiple comparisons, it severely inflates
Type I error (i.e., finding a difference when it does
Least Significant Difference
not actually exist). As a consequence, a revised ver-
The rationale behind the LSD technique value sion of the LSD test has been proposed by Anthony
comes from the observation that when the null J. Hayter (and is known as the FisherHayter proce-
hypothesis is true, the value of the t statistics eval- dure) where the modified LSD (MLSD) is used
uating the difference between Groups a and a is instead of the LSD. The MLSD is computed using
equal to the Studentized range distribution q as
Maþ Ma0 þ rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MSSðAÞ
ffi ð1Þ
MLSD ¼ qα, A1 ð5Þ
MSSðAÞ S þ S 0 1 1 S
a a
Table 1 Results for a Fictitious Replication of Loftus The data from a fictitious replication of Loftus’
and Palmer (1974) in Miles per Hour experiment are shown in Table 1. We have A ¼ 4
Contact Hit Bump Collide Smash groups and S ¼ 10 participants per group.
21 23 35 44 39 The ANOVA found an effect of the verb used
20 30 35 40 44 on participants’ responses. The ANOVA table is
26 34 52 33 51 shown in Table 2.
46 51 29 45 47
35 20 54 45 50
13 38 32 30 45 Least Significant Difference
41 34 30 46 39
30 44 42 34 51 For an α level of .05, the LSD for these data is
42 41 50 49 39 computed as
26 35 21 44 55 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
M.þ 30 35 38 41 46
2
LSD ¼ tν, :05 MSSðAÞ
n
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
Table 2 ANOVA Results for the Replication of ¼ tν, :05 80:00 ×
Loftus and Palmer (1974). 10
rffiffiffiffiffiffiffiffi ð6Þ
Source df SS MS F Pr(F) 160
¼ 2:01
Between: A 4 1,460 365 4.56 .0036 10
Error: S(A) 45 3,600 80 ¼ 2:01 × 4
Total 49 5,060 ¼ 8:04
Table 3 LSD: Differences Between Means and Significance of Pairwise Comparisons From the (Fictitious)
Replication of Loftus and Palmer (1974)
Experimental Group
M1.þ ¼ Contact M2.þ ¼ Hit M3. þ ¼ Bump M4.þ ¼ Collide M5.þ ¼ Smash
M1.þ ¼ 30 Contact 0.00 5.00 ns 8.00 ns 11.00** 16.00**
M2.þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00**
M3.þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4.þ ¼ 41 Collide 0.00 5.00 ns
M5.þ ¼ 46 Smash 0.00
Notes: Differences larger than 8.04 are significant at the α ¼ .05 level and are indicated with *, and differences larger than
10.76 are significant at the α ¼ .01 level and are indicated with **.
Table 4 MLSD: Differences Between Means and Significance of Pairwise Comparisons From the (Fictitious)
Replication of Loftus and Palmer (1974)
Experimental Group
M1.þ ¼ Contact M2.þ ¼ Hit M3.þ ¼ Bump M4.þ ¼ Collide M5.þ ¼ Smash
M1.þ ¼ 30 Contact 0.00 5.00 ns 8.00 ns 11.00** 16.00**
M2.þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00**
M3.þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4.þ ¼ 41 Collide 0.00 5.00 ns
M5.þ ¼ 46 Smash 0.00
Notes: Differences larger than 10.66 are significant at the α ¼ .05 level and are indicated with *, and differences larger than
13.21 are significant at the α ¼ .01 level and are indicated with **.
494 Fixed-Effects Models
A similar computation will show that, for these Hayter, A. J. (1986). The maximum familywise error rate
data, the LSD for an α level of .01 is equal to of Fisher’s least significant difference test. Journal of
LSD ¼ 2.69 × 4 ¼ 10.76. the American Statistical Association, 81, 10011004.
For example, the difference between Mcontact þ Seaman, M. A., Levin, J. R., & Serlin, R. C. (1991). New
developments in pairwise multiple comparisons: Some
and Mhit þ is declared nonsignificant because
powerful and practicable procedures. Psychological
Bulletin, 110, 577586.
jMcontactþ Mhitþ j ¼ j30 35j
ð7Þ
¼ 5 < 8:04:
variable to the specific levels of independent vari- Fixed-effects model experiment designs contrast
ables that are employed in a study. If the study is to sharply with random effects model designs, in
be repeated, the same levels of independent variables which the levels of independent variables are ran-
would be used again. As such, the inference space of domly selected from a large population of all possi-
the study, or studies, is the specific set of levels of ble levels. Either this population of levels is infinite
independent variables. Results are valid only at the in size, or the size is sufficiently large and can prac-
levels that are explicitly studied, and no extrapola- tically be considered infinite. The levels of indepen-
tion is to be made to levels of independent variables dent variables are therefore believed to be random
that are not explicitly investigated in the study. variables. Those that are chosen for a specific exper-
In practice, researchers often arbitrarily and sys- iment are a random draw. If the experiment is
tematically choose some specific levels of indepen- repeated, these same levels are unlikely to be
dent variables to investigate their effects according reused. Hence, it is meaningless to compare the
to a hypothesis and/or some prior knowledge about means of the dependent variable at those specific
the relationship between the dependent and inde- levels in one particular experiment. Instead, the
pendent variables. These levels of independent vari- experiment seeks inference about the effect of
ables are either of interest to the researcher or the entire population of all possible levels, which
thought to be representative of the independent are much broader than the specific ones used in the
variables. During the experiment, these levels are experiment, whether or not they are explicitly stud-
maintained constant. Measurements taken at each ied. In doing so, a random-effects model analysis
fixed level of an independent variable or a combina- draws conclusions from both within- and between-
tion of independent variables therefore constitute variable variation. Compared to fixed-effects model
a known population of responses to that level (com- designs, the advantages of random-effects model
bination of levels). Analyses then draw information designs are a more efficient use of statistical infor-
from the mean variation of the study to make infer- mation, and results from one experiment can be
ence about the effect of those specific independent extrapolated to levels that are not explicitly used in
variables at the specified levels on the mean that experiment. A key disadvantage is that some
response of the dependent variable. A key advan- important levels of independent variables may be
tage of a fixed-effects model design is that impor- left out of an experiment, which could potentially
tant levels of an independent variable can be have an adverse effect on the generality of conclu-
purposefully investigated. As such, both human and sions if those omitted levels turn out to be critical.
financial resource utilization efficiency may be max- To illustrate the differences between a fixed-
imized. Examples of such purposeful investigations and a random-effects model analysis, consider
may be some specific dosages of a new medicine in a controlled, two-factor factorial design experi-
a laboratory test for efficacy, or some specific che- ment. Assume that Factor A has a levels; Factor B
mical compositions in metallurgical research on the has b levels; and n measurements are taken from
strength of alloy steel, or some particular wheat each combination of the levels of the two factors.
varieties in an agriculture study on yields. Table 1 illustrates the ANOVA table comparing
The simplest example of a fixed-effects model a fixed- and a random-effects model. Notice in
design for comparing the difference in population Table 1 that the initial steps for calculating the
means is the paired t test model in a paired com- mean squares are similar in both analyses. The dif-
parison design. This design is a variation of the ferences are the expected mean squares and the
more general randomized block design in that each construction of hypothesis tests. Suppose that the
experimental unit serves as a block. Two treat- conditions for normality, linearity, and equal vari-
ments are applied to each experimental unit, with ance are all met, and the hypothesis tests on both
the order varying randomly from one experiment the main and the interactive effects of the fixed-
unit to the next. The null hypothesis of the paired effects model ANOVA are simply concerned with
t test is μ1 μ2 ¼ 0. Because of no sampling var- the error variance, which is the expected mean
iability between treatments, the precision of esti- square of the experimental error. In comparison,
mates in this design is considerably improved as the hypothesis tests on the main effect of the
compared to a two-sample t test model. random-effects model ANOVA draw information
Table 1 Analysis of Variance Table for the Two-Factor Factorial Design Comparing a Fixed-With a Random-Effects Model , Where Factor A Has a Levels,
Factor B Has b Levels, and n Replicates Are Measured at Each A × B Level
SSE
Error SSE ab(n 1) MSE ¼ E(MSE) ¼ σ 2 E(MSE) ¼ σ 2
abðn 1Þ
from both the experimental error variance and the Take as an example a hypothetical ecological
variance due to the interactive effect of the main study in 10 cities of a country on the association
experimental factors. In other words, hypothesis between lung cancer prevalence rates and average
tests in random-effects model ANOVA must be cigarette consumption per capita in populations 45
determined according to the expected mean years of age and older. Here, cigarette consump-
squares. Finding an appropriate error term for tion is the primary independent variable and lung
a test is not as straightforward in a random-effects cancer prevalence rate is the dependent variable.
model analysis as in a fixed-effects model analysis, Suppose that two surveys are done at two different
particularly when sophisticated designs such as the times and noticeable differences are observed in
split-plot design are used. In these designs, one both cigarette consumption and lung cancer preva-
often needs to consult an authoritative statistical lence rates at each survey time both across the 10
textbook, but not overly rely on commercial statis- cities (i.e., intercity variation) and within each of
tical software if he or she is not particularly famil- the 10 cities (i.e., intracity variation over time).
iar with the relevant analytical procedure. Both fixed and random regression models can be
used to analyze the data, depending on the
assumption that one makes.
Observational Studies
A fixed-effects regression model can be used if
In observational studies, researchers are often one makes assumptions such as no significant
unable to manipulate the levels of an indepen- changes in the demographic characteristics, in the
dent variable as they can frequently do in con- cigarette supply-and-demand relationship, in the
trolled experiments. Being the nature of an air pollution level and pollutant chemical composi-
observational study, there may be many influen- tion, or other covariates that might be inductive to
tial independent variables. Of them, some may lung cancer over time in each city, and if one fur-
be correlated with each other, whereas others are ther assumes that any unobservable variable that
independent. Some may be observable, but might simultaneously affect the lung cancer preva-
others are not. Of the unobservable variables, lence rate and the average per capita cigarette con-
researchers may have knowledge of some, but sumption does not change over time.
may be unaware of others. Unobservable vari-
ables are generally problematic and could com- yit ¼ β0 þ β1 xit þ αi þ εit ð1Þ
plicate data analyses. Those that are hidden
from the knowledge of the researchers are prob- where yit and xit are, respectively, the lung cancer
ably the worst offenders. They could potentially prevalence rate and the average per capita cigarette
lead to erroneous conclusions by obscuring the consumption in the ith city at time t; αi is a fixed
main results of a study. If a study takes repeated parameter for the ith city; and εit is the error term
measures (panel data), some of those variables for the ith city at time t. In this model, αi captures
may change values over the course of the study, the effects of all observed and unobserved time-
whereas others may not. All of these add com- invariant variables, such as demographic charac-
plexity to data analyses. teristics including age, gender, and ethnicity; socio-
If an observational study does have panel data, economic characteristics; air pollution; and other
the choice of statistical models depends on variables, which could vary from city to city but
whether or not the variables in question are corre- are constant within the ith city (this is why the
lated with the main independent variables. The above model is called a fixed-effects model). By
fixed-effects model is an effective tool if variables treating αi as fixed, the model focuses only on the
are correlated, whether they are measured or within-city variation while ignoring the between-
unmeasured. Otherwise, a random-effects model city variation.
should be employed. Size of observation units (i.e., The estimation of Equation 1 becomes ineffi-
number of students in an education study) or cient if many dummy variables are included in the
groupings of such units (i.e., number of schools) is model to accommodate a large number of observa-
generally not a good criterion for choosing one tional units (αi) in panel data, because this sacri-
particular statistical model over the other. fices many degrees of freedom. Furthermore,
498 Fixed-Effects Models
a large number of observational units coupled with information to the overall analysis with respect to
only a few time points may result in the intercepts those variables of concern. The second point is easy
of the model containing substantial random error, to understand if one treats a variable that does not
making them inconsistent. Not much, if any, infor- change values as a constant. A constant subtracting
mation could be gained from those noisy para- a constant is zero—that is, a zero effect of such vari-
meters. To circumvent these problems, one may ables on the dependent variable. In this regard,
convert the values of both the dependent and the fixed-effects models are mostly useful for studying
independent variables of each observational unit the effects of independent variables that show
into the difference from their respective mean for within-observational-unit variation.
that unit. The differences in the dependent variable If, on the other hand, there is reasonable doubt
are then regressed on the differences of the inde- regarding the assumptions made about a fixed-
pendent variables without an intercept term. The effects model, particularly if some independent
estimator then looks only at how changes in the variables are not correlated with the major inde-
independent variables cause the dependent variable pendent variable(s), a fixed-effects model will not
to vary around a mean within an observational be able to remove the bias caused by those vari-
unit. As such, the unit effects are removed from ables. For instance, in the above hypothetical study,
the model by differencing. a shortage in the cigarette supply caused a decrease
It is clear from the above discussions that the in its consumption in some cities, or a successful
key technique in using a fixed-effects model in promotion by cigarette makers or retailers per-
panel data is to allow each observational unit suaded more people to smoke in other cities. These
(‘‘city’’ in the earlier example) to serve as its own random changes from city to city make a fixed-
control so that the data are grouped. Conse- effects model unable to control effectively the
quently, a great strength of the fixed-effects model between-city variation in some of the independent
is that it simultaneously controls for both observ- variables. If this happens, a random-effects model
able and unobservable variables that are associated analysis would be more appropriate because it is
with each specific observational unit. The fixed- able to accommodate the variation by incorporat-
effect coefficients (αiI) absorb all of the across-unit ing in the model two sources of error. One source
influences, leaving only the within-unit effect for is specific to each individual observational unit,
the analysis. The result then simply shows how and the other source captures variation both within
much the dependent variable changes, on average, and between individual observational units.
in response to the variation in the independent
variables within the observational units; that is, in
Alternate Applications
the earlier example, how much, on average, the
lung cancer prevalence rate will go up or down in After discussing the application of fixed-effects
response to each unit change in the average ciga- model analyses in designed and observational
rette consumption per capita. research, it may also be helpful to mention the util-
Because fixed-effects regression model analyses ity of fixed-effects models in meta-analysis (a study
depend on each observational unit serving as its on studies). This is a popular technique widely
own control, key requirements in applying them in used for summarizing knowledge from individual
research are as follows: (a) There must be two or studies in social sciences, health research, and
more measurements on the same dependent variable other scientific areas that rely mostly on observa-
in an observational unit; otherwise, the unit effect tional studies to gather evidence. Meta-analysis is
cannot be properly controlled; and (b) independent needed because both the magnitude and the direc-
variables of interest must change values on at least tion of the effect size could vary considerably
two of the measurement occasions in some of the among observational studies that address the same
observational units. In other words, the effect of question. Public policies, health practices, or pro-
any independent variable that does not have much ducts developed based on the result of each indi-
within-unit variation cannot be estimated. Observa- vidual study therefore may not be able to achieve
tional units with values of little within-unit variation their desired effects as designed or as believed.
in some independent variables contribute less Through meta-analysis, individual studies are
Fixed-Effects Models 499
brought together and appraised systematically. in the task of choosing the right model for specific
Common knowledge is then explicitly generated to research. In ANOVA, after a model is chosen,
guide public policies, health practices, or product there is no easy way to identify the correct vari-
developments. ance components for computation of standard
In meta-analysis, each individual study is treated errors and for hypothesis tests (see Table 1, for
as a single analysis unit and plugged into a suitable example). This leads Gelman to advocate abolish-
statistical model according to some assumptions. ing the terminology of fixed- and random-effects
Fixed-effects models have long been used in meta- models. Instead, a unified approach is taken within
analysis, with the following assumptions: (a) Indi- a hierarchical (multilevel) model framework,
vidual studies are merely a sample of the same regardless of whether one is interested in the
population, and the true effect for each of them is effects of specific treatments used in a particular
therefore the same; and (b) there is no heterogene- experiment (fixed-effects model analyses in a tradi-
ity among study results. Under these assumptions, tional sense) or in the effects of the underlying
only the sampling error (i.e., the within-study vari- population of treatments (random-effects model
ation) is responsible for the differences (as reflected analyses otherwise). In meta-analysis, Bayesian
in the confidence interval) in the observed effect model averaging is another alternative to fixed-
among studies. The between-study variation in the and random-effects model analyses.
estimated effects has no consequence on the confi-
dence interval in a fixed-effects model analysis.
Final Thoughts
These assumptions may not be realistic in many
instances and are frequently hotly debated. An Fixed-effects models concern mostly the response
important difficulty in applying a fixed-effects of dependent variables at the fixed levels of
model in meta-analysis is that each individual independent variables in a designed experiment.
study is conducted on different study units (individ- Results thus obtained generally are not extrapo-
ual persons, for instance) under a different set of lated to other levels that are not explicitly investi-
conditions by different researchers. Any (or all) of gated in the experiment. In observational studies
these differences could potentially introduce its (or with repeated measures, fixed-effects models are
their) effects into the studies to cause variation in used principally for controlling the effects of
their results. Therefore, one needs to consider not unmeasured variables if these variables are corre-
only within- but also between-study variation in lated with the independent variables of primary
a model in order to generalize knowledge properly interest. If this assumption does not hold, a fixed-
across studies. Because the objective of meta-analy- effects model cannot adequately control for inter-
ses is to seek validity generalization, and because unit variation in some of the independent vari-
heterogeneity tests are not always sufficiently sensi- ables. A random-effects model would be more
tive, a random-effects model is thus believed to be appropriate.
more appropriate than a fixed-effects model.
Unless there is truly no heterogeneity confirmed Shihe Fan
through proper investigations, fixed-effects model See also Analysis of Variance (ANOVA); Bivariate
analyses tend to overestimate the true effect by Regression; Random-Effects Models
producing a smaller confidence interval. On the
other hand, critics argue that random-effects mod- Further Readings
els make assumptions about distributions, which
may or may not be realistic or justified. They give Gelman, A. (2005). Analysis of variance: Why it is more
more weight to small studies and are more sensitive important than ever. Annals of Statistics, 33, 153.
Hocking, R. R. (2003). Methods and application of linear
to publication bias. Readers interested in meta-
models: Regression and the analysis of variance
analysis should consult relevant literature before (2nd ed.). Hoboken, NJ: Wiley.
embarking on a meta-analysis mission. Kuehl, R. O. (1994). Statistical principles of research
The arguments for and against fixed- and design and analysis. Belmont, CA: Duxbury.
random-effects models seem so strong, at least on Montgomery, D. C. (2001). Design and analysis of
the surface, that a practitioner may be bewildered experiments (5th ed.). Toronto: Wiley.
500 Focus Group
changed the lives of the study participants. administered to ascertain clarity of the reworded
Regardless of its purpose, follow-up always has questions. Likewise, a supervisor may discover
cost implications. that one or more telephone interviewers are not
administering their telephone surveys according to
protocol. This would require that some follow-up
Typical Follow-Up Activities
training be conducted for those interviewers.
Participants
In the conduct of survey research, interviewers Project Milestones
often have to make multiple attempts to schedule
face-to-face and telephone interviews. When Research activities require careful monitoring
face-to-face interviews are being administered, and follow-up to ensure that things are progressing
appointments generally need to be scheduled in smoothly. Major deviations from project milestones
advance. However, participants’ schedules may generally require quick follow-up action to get the
make this simple task difficult. In some cases, mul- activity back on schedule to avoid schedule slippage
tiple telephone calls and/or letters may be required and cost overruns.
in order to set up a single interview. In other cases
(e.g., national census), follow-up may be required
because participants were either not at home or Incentives
were busy at the time of the interviewer’s visits. In research, incentives are often offered to
Likewise, in the case of telephone interviews, inter- encourage participation. Researchers and research
viewers may need to call potential participants sev- organizations therefore need to follow up on their
eral times before they are actually successful in promises and mail the promised incentive to all
getting participants on the phone. persons who participated in the research.
With mail surveys, properly timed (i.e., prede-
fined follow-up dates—usually every 2 weeks)
follow-up reminders are an effective strategy to Thank-You Letters
improve overall response rates. Without such
reminders, mail response rates are likely to be less Information for research is collected using
than 50%. Follow-up reminders generally take a number of techniques (e.g., focus groups, infor-
one of two forms: a letter or postcard reminding mants, face-to-face interviews). Follow-up thank-
potential participants about the survey and encour- you letters should be a normal part of good
aging them to participate, or a new survey package research protocol to thank individuals for their
(i.e., a copy of the survey, return envelope, and time and contributions.
a reminder letter). The latter technique generally
proves to be more effective because many potential
Stakeholder Debriefing
participants either discard mail surveys as soon as
they are received or are likely to misplace the Following the completion of the research, one
survey if it is not completed soon after receipt. or more follow-up meetings may be held with sta-
keholders to discuss the research findings, as well
as any follow-up studies that may be required.
Review New Developments
During a particular research study, any number
Compliance With Institutional Review Boards
of new developments can occur that would require
follow-up action to correct. For example, a pilot The U.S. Department of Health and Human
study may reveal that certain questions were Services (Office of Human Research Protections)
worded in such an ambiguous manner that most Regulation 45 CFR 46.109(e) requires that institu-
participants skipped the questions. To correct this tional review boards conduct follow-up reviews at
problem, the questions would need to be reworded least annually on a number of specific issues when
and a follow-up pilot study would need to be research studies exceed one year.
Frequency Distribution 503
digits, and the leaf consists of the last digit. observations corresponding to each score or fall-
Whereas the stem can have any number of digits, ing within each class is counted. Table 2 presents
the leaf will always have only one. Table 1 shows a frequency distribution table for the age of the
a stem-and-leaf plot of the ages of the participants participants at the city hall meeting from the ear-
at a city hall meeting. lier example.
The plot shows that 20 people have participated Apart from a list of the scores or classes and
at the city hall meeting, five in their 30s, none in their corresponding frequencies, frequency tables
his or her 40s, eight in their 50s, five in their 60s, may also contain relative frequencies or propor-
and two in their 70s. tions (obtained by dividing the simple frequencies
Stem-and-leaf plots have the advantage of by the number of cases) and percentage frequen-
being easily constructed from the raw data. cies (obtained by multiplying the relative frequen-
Whereas the construction of cumulative fre- cies by 100).
quency distributions and histograms often Frequency tables may also include cumulative
requires the use of computers, stem-and-leaf frequencies, proportions, or percentages. Cumula-
plots are a simple paper-and-pencil method for tive frequencies are obtained by adding the fre-
analyzing data sets. Moreover, no information is quency of each observation to the sum of the
lost in the process of building up stem-and-leaf frequencies of all previous observations. Cumula-
plots, as is the case in, for example, grouped fre- tive proportions and cumulative percentages are
quency distributions. calculated similarly; the only difference is that,
instead of simple frequencies, cumulative frequen-
cies are divided by the total number of cases for
Frequency Tables obtaining cumulative proportions.
Frequency tables look similar for nominal or
A table that shows the distribution of the fre- categorical variables, except the first column con-
quency of occurrence of the scores a variable may tains categories instead of scores or classes. In
take in a data set is called a frequency table. Fre- some frequency tables, the missing scores for nom-
quency tables are generally univariate, because it is inal variables are not counted, and thus, propor-
more difficult to build up multivariate tables. They tions and percentages are computed based on the
can be drawn for both ungrouped and grouped number of nonmissing scores. In other frequency
scores. Frequency tables with ungrouped scores tables, the missing scores may be included as a cat-
are typically used for discrete variables and when egory so that proportions and percentages can be
the number of different scores the variable may computed based on the full sample size of non-
take is relatively low. When the variable to be ana- missing and missing scores. Either approach has
lyzed is continuous and/or the number of scores it analytical value, but authors must be clear about
may take is high, the scores are usually grouped which base number is used in calculating any pro-
into classes. portions or percentages.
Two steps must be followed to build a fre-
quency table out of a set of data. First, the scores
or classes are arranged in an array (in an ascend- Table 2 Frequency Table of the Age of the
ing or descending order). Then, the number of Participants at a City Hall Meeting
Relative Percentage
Table 1 Stem-and-Leaf Plot of the Ages of the Frequency Frequency
Participants at a City Hall Meeting Age y Frequency f rf ¼ f =n p ¼ 100 rf
Stem Leaf 3039 5 0.25 25.00
3 34457 4049 0 0.00 00.00
4 5059 8 0.40 40.00
5 23568889 6069 5 0.25 25.00
6 23778 7079 2 0.10 10.00
7 14 n ¼ 20 1.00 100.00%
Frequency Distribution 505
Frequency Graphs
Frequency graphs can take the form of bar charts
and histograms, or polygons.
Frequency
graphically through the use of a bar chart or a his- 5
togram. Bar charts are used for categorical vari-
2
ables, whereas histograms are used for scalable
variables. Bar charts resemble histograms in that 0
bar heights correspond to frequencies, proportions, 35 45 55 65 75
or percentages. Unlike the bars in a histogram, Age Groups
bars in bar charts are separated by spaces, thus
indicating that the categories are in arbitrary order
and that the variable is categorical. In contrast, Figure 1 Histogram of the Age of the Participants
at a City Hall Meeting
spaces in a histogram signify zero scores.
Both bar charts and histograms are represented
in an upper-right quadrant delimited by a horizon- midpoint of the upper base of each of the histo-
tal x-axis and a vertical y-axis. The vertical axis gram’s end columns to the midpoints of the adja-
typically begins with zero at the intersection of the cent intervals (on the x-axis). Figure 2 presents
two axes; the horizontal scale need not begin with a frequency polygon based on the histogram in
zero, if this leads to a better graphic representa- Figure 1.
tion. Scores are represented on the horizontal axis, Histograms and frequency polygons may also
and frequencies, proportions, or percentages are be constructed for relative frequencies and percen-
represented on the vertical axis. When working tages in a similar way. The advantage of using
with classes, either limits or midpoints of class graphs of relative frequencies is that they can be
interval are measured on the x-axis. Each bar is used to directly compare samples of different sizes.
centered on the midpoint of its corresponding class Frequency polygons are especially useful for
interval; its vertical sides are drawn at the real lim- graphically depicting cumulative distributions.
its of the respective interval. The base of each bar
represents the width of the class interval.
Distribution Shape and Modality
A frequency distribution is graphically dis-
played on the basis of the frequency table that Frequency distributions can be described by their
summarizes the sample data. The frequency distri- skewness, kurtosis, and modality.
bution in Table 2 is graphically displayed in the
histogram depicted in Figure 1.
Skewness
Frequency Polygons
Frequency distributions may be symmetrical or
Frequency polygons are drawn by joining the skewed. Symmetrical distributions imply equal
points formed by the midpoint of each class inter- proportions of cases at any given distance above
val and the frequency corresponding to that class and below the midpoint on the score range scale.
interval. However, it may be easier to derive the Consequently, each half of a symmetrical distribu-
frequency polygon from the histogram. In this tion looks like a mirror image of the other half.
case, the frequency polygon is drawn by joining Symmetrical distributions may be uniform (rectan-
the midpoints of the upper bases of adjacent bars gular) or bell-shaped, and they may have one, two,
of the histogram by straight lines. Frequency poly- or more peaks.
gons are typically closed at each end. To close Perfectly symmetrical distributions are seldom
them, lines are drawn from the point given by the encountered in practice. Skewed or asymmetrical
506 Frequency Distribution
5
Figure 3 Example of Symmetrical, Positively Skewed,
2
and Negatively Skewed Distributions
0
35 45 55 65 75 85
Age Groups
Table 1 Raw Data of the Number of Children Relative frequencies, also called proportions,
Families Have in a Small Community are computed as frequencies divided by the sample
5 2 3 4 5 5 4 5 2 0 size: rf ¼ f/n. In this equation, rf represents the
relative frequency corresponding to a particular
score, f represents the frequency corresponding to
the same score, and n represents the total number
Table 2 An Ascending Array of the Number of of cases in the analyzed sample. They indicate the
Children Families Have in a Small Community proportion of observations corresponding to each
0 2 2 3 4 4 5 5 5 5 score. For example, the proportion of families with
two children in the analyzed community is 0.20.
Percentages are computed as proportions multi-
plied by 100: p ¼ rf(100), where p represents the
Table 3 Frequency Table of the Number of Children percentage and rf represents the relative frequency
Families Have in a Small Community corresponding to a particular score. They indicate
Relative Percentage what percentage of observations corresponds to
Number of Frequency Frequency each score. For example, 20% of the families in
Children y Frequency f rf ¼ f =n p ¼ 100 rf the observed sample have four children each. Pro-
portions in a frequency table must sum to 1.00,
0 1 0.10 10.00
whereas percentages must sum to 100.00. Due to
1 0 0.00 00.00
rounding off, some imprecision may sometimes
2 2 0.20 20.00
occur, and the total proportion and percentage
3 1 0.10 10.00
may be just short of or a little more than 1.00 or
4 2 0.20 20.00
100.00%, respectively. However, this issue is now
5 4 0.40 40.00
completely solved through the use of computer
n ¼ 10 1.00 100.00%
programs for such calculations.
Table 4 Raw Data on Students’ Test Scores Table 5 Grouped Frequency Table of Students’ Test
86.5 87.7 78.8 Scores
88.1 86.2 87.3 Students’ Relative Percentage
99.6 96.3 92.1 Test Frequency Frequency
92.5 79.1 89.8 Scores y Frequency f rf ¼ f =n p ¼ 100 rf
76.1 98.8 98.1 75.180.0 3 0.20 20.00
80.185.0 0 0.00 00.00
85.190.0 6 0.40 40.00
classes. Four steps must be followed to build 90.195.0 2 0.13 13.33
a grouped frequency table: (a) Arrange the scores 95.1100.0 4 0.27 26.67
into an array, (b) determine the number of classes, n ¼ 15 1.00 100.00%
(c) determine the size or width of the classes, and
(d) determine the number of observations that fall
into each class. students have scored between 75.1 and 80.0, and
Table 4 displays a sample of the scores obtained four students have scored between 95.1 and the
by a sample of students at an exam. maximum score of 100.0. Again, there may be
A series of rules applies to class selection: some classes for which the frequency is zero,
(a) All observations must be included; (b) each meaning that no case falls within that class. How-
observation must be assigned to only one class; ever, these classes must also be listed (for example,
(c) no scores can fall between two intervals; and no students have obtained between 80.1 and 85.0
(d) whenever possible, class intervals (the width of points; nevertheless, this case is listed in Table 5).
each class) must be equal. Typically, researchers In general, grouped frequency tables include
choose a manageable number of classes. Too few a column displaying the classes and a column
classes lead to a loss of information from the data, showing their corresponding frequencies, but
whereas too many classes lead to difficulty in ana- they may also include relative frequencies (pro-
lyzing and understanding the data. It is sometimes portions) and percentages. Proportions and per-
recommended that the width of the class intervals centages are computed in the same way as for
be determined by dividing the difference between ungrouped frequency tables. Their meaning
the largest and the smallest score (the range of changes, though. For example, Table 5 shows
scores) by the number of class intervals to be used. that the proportion of students who have scored
Referring to the above example, students’ lowest between 85.1 and 90.0 is 0.40, and that 13.33%
and highest test scores are 76.1 and 99.6, of the students have scored between 90.1 and
respectively. The width of the class interval, i, 95.0.
would then be found by computing:
99:6 76:1
i¼ ¼ 4:7, if the desired number of
5 Stated Limits and
classes is five. The five classes would then be:
Real Limits of a Class Interval
76.180.8; 80.985.5; 85.690.2; 90.394.9;
95.099.6. However, this is not a very convenient It is relevant when working with continuous vari-
grouping. It would be easier to use intervals of 5 ables to define both stated and real class limits.
or 10, and limits that are multiples of 5 or 10. The lower and upper stated limits, also known as
There are many situations in which midpoints are apparent limits of a class, are the lowest and high-
used for analysis, and midpoints of 5 and 10 inter- est scores that could fall into that class. For exam-
vals are easier to calculate than midpoints of 4.7 ple, for the class 75.180.0, the lower stated limit
intervals. Based on this reasoning, the observations is 75.1 and the upper stated limit is 80.0.
in the example data set are grouped into five clas- The lower real limit is defined as the point that
ses, as in Table 5. Finally, the number of observa- is midway between the stated lower limit of a class
tions that fall within each interval is counted. and the stated upper limit of the next lower class.
Frequencies measure the number of cases that The upper real limit is defined as the point that is
fall within each class. This means that three midway between the stated upper limit of a class
510 Friedman Test
and the stated lower limit of the next higher class. and four children, in Table 3). Frequency tables also
For example, the lower real limit of the class represent the first step in drawing histograms and
80.185.0 is 80.05, and the upper real limit is calculating means from grouped data.
85.05. Using relative frequency distributions or per-
Real limits may be determined not only for clas- centage frequency tables is important when com-
ses, but also for numbers. In the case of numbers, paring the frequency distributions of samples with
real limits are the points midway between a partic- different sample sizes. Whereas simple frequencies
ular number and the next lower and higher num- depend on the total number of observations, rela-
bers on the scale used in the respective research. tive frequencies and percentage frequencies do not
For example, the lower real limit of number 4 on and thus may be used for comparisons.
a 1-unit scale is 3.5, and its upper real limit is 4.5. The main drawback of using frequency tables is
However, real limits are not always calculated as the loss of detailed information. Especially when
midpoints. For example, most individuals identify data are grouped into classes, the information for
their age using their most recent birthday. Thus, it individual cases is no longer available. This means
is considered that a person 39 years old is at least that all scores in a class are dealt with as if they
39 years old and has not reached his 40th birth- were identical. For example, the reader of Table 5
day, and not that he is older than 38 years and 6 learns that six students have scored between 85.1
months and younger than 39 years and 6 months. and 90.0, but the reader does not learn any more
For discrete numbers, there are no such things details about the individual test results.
as stated and real limits. When counting the num-
ber of people present at a meeting, limits do not Oana Pusa Mihaescu
extend below and above the respective number
See also Cumulative Frequency Distribution; Descriptive
reported. If there are 120 people, all limits are
Statistics; Distribution; Frequency Distribution;
equal to 120.
Histogram
Each class has a midpoint defined as the point Downie, N. M., & Heath, R. W. (1974). Basic statistical
midway between the real limits of the class. Mid- methods. New York: Harper & Row.
points are calculated by adding the values of the Fielding, J. L., & Gilbert, G. N. (2000). Understanding
social statistics. Thousand Oaks, CA: Sage.
stated or real limits of a class and dividing the sum
Fried, R. (1969). Introduction to statistics. New York:
by two. For example, the midpoint for the class Oxford University Press.
80:1 þ 85:0 80:05 þ 85:05 Hamilton, L. (1996). Data analysis for social sciences: A
80.185.0 is m ¼ ¼ ¼
2 2 first course in applied statistics. Belmont, CA:
82:55. Wadsworth.
Kiess, H. O. (2002). Statistical concepts for the
behavioral sciences. Boston: Allyn & Bacon.
Advantages and Drawbacks Kolstoe, R. H. (1973). Introduction to statistics for the
of Using Frequency Tables behavioral sciences. Homewood, IL: Dorsey.
Lindquist, E. F. (1942). A first course in statistics: Their
The main advantage of using frequency tables is use and interpretation in education and psychology.
that data are grouped and thus easier to read. Fre- Cambridge, MA: Riverside Press.
quency tables allow the reader to immediately
notice a series of characteristics of the analyzed data
set that could probably not have been easily seen
when looking at the raw data: the lowest score (i.e., FRIEDMAN TEST
0, in Table 3); the highest score (i.e., 5, in Table 3);
the most frequently occurring score (i.e., 5, in Table In an attempt to control for unwanted variability,
3); and how many observations fall between two researchers often implement designs that pair or
given scores (i.e., five families have between two group participants into subsets based on common
Friedman Test 511
characteristics (e.g., randomized block design) or Table 1 Ranks for Randomized Block Design
implement designs that observe the same partici- Treatment Conditions
pant across a series of conditions (e.g., repeated-
measures design). The analysis of variance 1 2 . . . K Row Means
(ANOVA) is a common statistical method used to Blocks 1 R11 R12 . . . R1K Kþ1
R1 ¼
analyze data from a randomized block or repeated- 2
measures design. However, the assumption of nor- 2 R21 R22 . . . R2K Kþ1
R2 ¼
mality that underlies ANOVA is often violated, or .. 2
.. . . . . . . . . . . . . .
the scale of measurement for the dependent vari- .
able is ordinal-level, hindering the use of ANOVA. N RN1 RN2 . . . RNK Kþ1
To address this situation, economist Milton Fried- RN ¼
2
man developed a statistical test based on ranks that Column Means R1 R2 . . . R3 Kþ1
R¼
may be applied to data from randomized block or 2
repeated measures designs where the purpose is to
detect differences across two or more conditions.
This entry describes this statistical test, named the It is apparent that row means in Table 1 (i.e.,
Friedman Test, which may be used in lieu of mean of ranks for each block) are the same across
ANOVA. The Friedman test is classified as a non- blocks; however, the column means (i.e., mean of
parametric test because it does not require a specific ranks within a treatment condition) will be
distributional assumption. A primary advantage of affected by differences across treatment conditions.
the Friedman test is that it can be applied more Under the null hypothesis that there is no differ-
widely as compared to ANOVA. ence due to treatment, the ranks are assigned at
random, and thus, an equal frequency of ranks
would be expected for each treatment condition.
Procedure Therefore, if there is no treatment effect, then the
The Friedman test is used to analyze several column means are expected to be the same for
related (i.e., dependent) samples. Friedman each treatment condition. The null hypothesis may
referred to his procedure as the method of ranks in be specified as follows:
that it is based on replacing the original scores
Kþ1
with rank-ordered values. Consider a study in H0 : μR ¼ μR ¼ ¼ μR ¼ : ð1Þ
which data are collected within a randomized
·1 ·2 ·K 2
block design where N blocks are observed over K To test the null hypothesis that there is no treat-
treatment conditions on a dependent measure that ment effect, the following test statistic may be
is at least ordinal-level. The first step in the Fried- computed:
man test is to replace the original scores with
ranks, denoted Rjk, within each block; that is, the PK 2
scores for block j are compared with each other, N
R·k R· ·
k¼1
and a rank of 1 is assigned to the smallest ðTSÞ ¼ , ð2Þ
observed score, a rank of 2 is assigned to the sec- PP
N K 2
Rjk R · · ðN ðK 1ÞÞ
ond smallest, and so on until the largest value is j¼1 k¼1
replaced by a rank of K. In the situation where
there are ties within a block (i.e., two or more of where R · k represents the mean value for treatment
the values are identical), the midrank is used. The k; R · · represents the grand mean (i.e., mean of all
midrank is the average of the ranks that would rank values); and Rjk represents the rank for block
have been assigned if there were no ties. Note that j and treatment k. Interestingly, the numerator and
this procedure generalizes to a repeated measures denominator of (TS) can be obtained using
design in that the ranks are based on within- repeated measures ANOVA on the ranks. The
participant observations (or, one can think of the numerator is the sum of squares for the treatment
participants as defining the blocks). Table 1 pre- effect (SSeffect). The denominator is the sum of
sents the ranked data in tabular form. squares total (which equals the sum of squares
512 Friedman Test
within-subjects because there is no between- Table 2 Resting Heart Rate as Measured by Beats
subjects variability) divided by the degrees of free- per Minute
dom for the treatment effect plus the degrees of Weight-Lifting Bicycling Running
freedom for the error term. Furthermore, the test 1 72 65 66
statistic provided in Equation 2 does not need to 2 65 67 67
be adjusted when ties exist. 3 69 65 68
An exact distribution for the test statistic may 4 65 61 60
be obtained using permutation in which all possi- Block 5 71 62 63
ble values of (TS) are computed by distributing the 6 65 60 61
rank values within and across blocks in all possible 7 82 72 73
combinations. For an exact distribution, the 8 83 71 70
p value is determined by the proportion of values 9 77 73 72
of (TS) in the exact distribution that are greater 10 78 74 73
than the observed (TS) value. In the recent past,
the use of the exact distribution in obtaining the p
value was not feasible due to the immense comput- measured by beats per minute. The researcher
ing power required to implement the permutation. implemented a randomized block design in which
However, modern-day computers can easily con- initial resting heart rate and body weight (variables
struct the exact distribution for even a moderately that are considered important for response to exer-
large number of blocks. Nonetheless, for a suffi- cise) were used to assign participants into relevant
cient number of blocks, the test statistic is distrib- blocks. Participants within each block were ran-
uted as a chi-square with degrees of freedom equal domly assigned to one of the three exercise modes
to the number of treatment conditions minus 1 (i.e., treatment condition). After one month of
(i.e., K 1). Therefore, the chi-square distribution exercising, the resting heart of each participant
may be used to obtain the p value for (TS) when was recorded and is shown in Table 2.
the number of blocks is sufficient. The first step in the Friedman test is to replace
The Friedman test may be viewed as an exten- the original scores with ranks within each block.
sion of the sign test. In fact, in the context of two For example, for the first block, the smallest origi-
treatment conditions, the Friedman test provides nal score of 65, which was associated with the par-
the same result as the sign test. As a result, multi- ticipant in the bicycling group, was replaced by
ple comparisons may be conducted either by using a rank of 1; the original score of 66 associated
the sign test or by implementing the procedure for with the running group was replaced by a rank of
the Friedman test on the two treatment conditions 2; and the original score of 72, associated with
of interest. The familywise error rate can be con- weightlifting, was replaced by a rank of 3. Further-
trolled using typical methods such as Dunn- more, note that for Block 2, the original values
Bonferroni or Holm’s Sequential Rejective Proce- of resting heart rate were the same for the bicy-
dure. For example, when the degrees of freedom cling and running conditions (i.e., beats per minute
equals 2 (i.e., K ¼ 3), then the Fisher least signifi- equaled 67 for both conditions as shown in
cant difference (LSD) procedure may be implemen- Table 2). Therefore, the midrank value of 2.5 was
ted in which the omnibus hypothesis is tested first; used, which was based on the average of the ranks
if the omnibus hypothesis is rejected, then each they would have received if they were not tied
multiple comparison may be conducted using (i.e., [2 þ 3]/2 ¼ 2.5). Table 3 reports the rank
either the sign or the Friedman test on the specific values for each block.
treatment conditions using a full α level. The mean of the ranked values for each block
(Rj ·) is identical because the ranks were based on
within blocks. Therefore, there is no variability
Example
across blocks once the original scores have been
Suppose a researcher was interested in examining replaced by the ranks. However, the mean of the
the effect of three types of exercises (weightlifting, ranks varies across treatment conditions (R · k ). If
bicycling, and running) on resting heart rate as the treatment conditions are identical in the
F Test 513
Table 3 Rank Values of Resting Heart Rate Within Table 4 p Values for the Pairwise Comparisons
Each Block Comparison p value (Exact, Two-tailed)
Weight-Lifting Bicycling Running Rj Weight-lifting vs. Bicycling 0.021
1 3 1 2 2 Weight-lifting vs. Running 0.021
2 1 2.5 2.5 2 Bicycling vs. Running 1.000
3 3 1 2 2
4 3 2 1 2
Block 5 3 1 2 2 procedure can be used to control the familywise
6 3 1 2 2 error rate. The omnibus test was significant at α ¼
7 3 1 2 2 0.05, therefore, each of the pairwise comparisons
8 3 2 1 2 can be tested using α ¼ 0.05. Table 4 reports the
9 3 2 1 2 exact p values (two-tailed) for the three pairwise
10 3 2 1 2 comparisons. From the analyses, the researcher
Rk 2.8 1.55 1.65 2 can conclude that the weightlifting condition dif-
fered in its effect on resting heart rate compared to
running and bicycling; however, it cannot be con-
population, R · k s are expected to be similar across cluded that the running and bicycling conditions
the three conditions (i.e., R · k ¼ 2). differed.
The omnibus test statistic, (TS), is computed for
the data shown in Table 3 as follows: Craig Stephen Wells
Analysis of Variance
F Tests
F tests are used in ANOVA. The total sum of
Given a null hypothesis H0 and a significance squared deviations is decomposed into parts cor-
level α, the corresponding F test rejects H0 if the responding to different factors. In the normal
value of the F statistic is large; more precisely, if case, these parts have distributions related to
F > Fm;n; α , the upper αth quantile of the chi-square. The F statistics are ratios of these
Fm,n distribution. The values of m and n depend parts and hence have F distributions in the nor-
upon the particular problem (comparing var- mal case.
iances, ANOVA, multiple regression). The
achieved (descriptive) level of significance (p
One-Way ANOVA
value) of the test is the probability that a variable
with the Fm,n distribution exceeds the observed One-way ANOVA is for comparison of the
value of the statistic F. The null hypothesis is means of several groups. The data are Ygi,g ¼ 1;
rejected if p < α. 2; . . . ; k groups, i ¼ 1; 2; . . . ; Ng cases in the gth
Many tables are available for the quantiles, but group.
they can be obtained in Excel and in statistical The model is
computer packages, and p values are given in the
output for various procedures. Ygi ¼ μg þ εgi ;
F Test 515
where the errors εgi have mean zero and constant that is,
variance σ 2, and are uncorrelated. The null
hypothesis (hypothesis of no differences between ðN 1Þ ¼ ðk 1Þ þ ðN kÞ:
group means) is H0 : μ1 ¼ μ2 ¼ ¼ μk . It is
Each mean square is the corresponding sum of
convenient to reparametrize as μg ¼ μ þ αg,
squares, divided by its d.f.: MSTot ¼ SSTot/
where αg is the deviation of the true mean μg for
DFTot ¼ SSTot/(N l) is just the sample variance
group g from the true overall mean μ The devia-
P of Y; MSB ¼ SSB/DFB ¼ SSB/(k l), and
tions satisfy a constraint such as kg¼1 Ng αg ¼ 0. MSW ¼ SSW/DFW ¼ SSW/(N k). The rele-
In terms of the αg , the model is vant F statistic is F ¼ MSB/MSW. For F to have an
F distribution, the errors must be normally distrib-
Ygi ¼ μ þ αg þ εgi ; uted. The residuals can be examined to see if their
histogram looks bell-shaped and not too heavy-
and H0 is α1 ¼ α2 ¼ ¼ αk :
tailed, and a normal quantile plot can be used.
tions as
a measure of dispersion of the true means μg. The
Ygi ¼ μ
^ þ α^g þ ε^i noncentral F distribution is related to the noncen-
¼ Y:: þ ðYg : Yþþ Þ þ ðYgi Yg :Þ: tral chi-square distribution. The noncentral chi-
square distribution with m degrees of freedom and
This is noncentrality parameter δ2 is the distribution of the
ðYgi Y::Þ ¼ ðYg: Y:: þ ðYgi Yg: Þ: sum of squares of m independent normal variables
with variances equal to 1 and means whose sum of
Squaring both sides and summing gives the squares is δ2. If, in the ratio (U/m)/(V/n), the vari-
analogous decomposition of the sum of squares able U has a noncentral chi-square distribution,
then the ratio has a noncentral F distribution.
Ng
X
k X
2
When the null hypothesis of equality of means is
ðYgi Y::Þ ¼ false, the test statistic has a noncentral F distribu-
g¼1 i¼1
tion. The noncentrality parameter depends upon
Ng Ng
X
k X X
k X the group means and the sample sizes. Power com-
ðYg : Y::Þ2 þ ðYgi Yg :Þ2 ¼ putations involve the noncentral F distribution. It is
g¼1 i¼1 g¼1 i¼1
via the noncentrality parameter that one specifies
Ng
X
k
2
X
k X what constitutes a reasonably large departure from
Ng ðYg : Y::Þ þ ðYgi Yg :Þ2 the null hypothesis. Ideally, the level α and the sam-
g¼1 g¼1 i¼1
ple sizes are chosen so that the power is sufficiently
or SSTot ¼ SSB þ SSW: large (say, .8 or .9) for large departures from the
null hypothesis.
Here, SSTot denotes the total sum of squares; SSB,
between-group sum of squares; and SSW, within- Randomized Blocks Design
group sum of squares. The decomposition of d.f. is
This is two-way ANOVA with no replication.
DFTot ¼ DFB þ DFW; There are two factors A and B with a and b levels,
516 F Test
thus, ab observations. The decomposition of the understand the F tests for more complicated
sum of squares is designs.
SSTot ¼ SSA þ SSB þ SSRes: Multiple Regression
The decomposition of d.f. is Given a data set of observations on explanatory
variables X1; X2; . . . ; Xp and a dependent variable
DFTot ¼ DFA þ DFB þ DFRes Y for each of N cases, the multiple linear regres-
that is, sion model takes the expected value of Y to be
a linear function of X1; X2; . . . ; Xp. That is, the
ðab 1Þ ¼ ða 1Þ þ ðb 1Þ þ ða 1Þðb 1Þ: mean Ex(Y) of Y for given values x1; x2; . . . ; xp is
of the form
Each mean square is the corresponding sum of
squares, divided by its d.f.: MSA ¼ SSA/DFA ¼ Ex ðYÞ ¼ β0 þ β1 x1 þ β2 x2 þ þ βp xp ,
SSA/(a 1), MSB ¼ SSB/DFB ¼ SSB/(b 1),
MSRes ¼ SSRes/DFRes ¼ SSRes/(a 1)(b 1). where the βj are parameters to be estimated, and
The test statistics are FA ¼ MSA/MSRes with the error is additive,
a 1 and (a 1)(b 1) d.f. and FB ¼ MSB/ Y ¼ Ex ðYÞ þ ε:
MSRes with b 1 and (a 1)(b 1) d.f.
Writing this in terms of the N cases gives the
observational model for i ¼ 1; 2; . . . ; N,
Two-Way ANOVA
Yi ¼ β0 þ β1 x1i þ β2 x2i þ þ βp xpi þ εi :
When there is replication, with r replicates for
each combination of levels of A and B, the decom- The assumptions on the errors are that they
position of SSTot is have mean zero and common variance σ 2, and are
uncorrelated.
SSTot ¼ SSA þ SSB þ SSA × B þ SSRes,
The F statistic for testing the null hypothesis See also Analysis of Variance (ANOVA); Coefficients of
concerning the coefficient of a single variable is Correlation, Alienation, and Determination;
the square of the t statistic for this test. But F Experimental Design; Factorial Design; Hypothesis;
can be used for testing several variables at a time. Least Squares, Methods of; Significance Level,
It is often of interest to test a portion of a model, Concept of; Significance Level, Interpretation and
that is, to test whether a subset of the Construction; Stepwise Regression
variables—say, the first q variables—is adequate.
Let p ¼ q þ r; it is being considered whether Further Readings
the last r ¼ p q variables are needed. The null
hypothesis is Bennett, J. H. (Ed.). (1971). The collected papers of
R. A. Fisher. Adelaide, Australia: University of
H0 : βj ¼ 0, j ¼ q þ 1, . . . , p: Adelaide Press.
518 F Test
Fisher, R. A. (1925). Statistical methods for research Kempthorne, O. (1952). The design and analysis of
workers. Edinburgh, UK: Oliver and Boyd. experiments. New York: Wiley.
Graybill, F. A. (1976). Theory and application of the Scheffé, H. (1959). The analysis of variance. New York:
linear model. N. Scituate, MA: Duxbury Press. Wiley.
Hogg, R. V., McKean, J. W., & Craig, A. T. (2005). Snedecor, G. W., & Cochran, W. G. (1989). Statistical
Introduction to mathematical statistics (6th ed.). methods (8th ed.). Ames: Iowa State University Press.
Upper Saddle River, NJ: Prentice Hall.
G
pretest scores (i.e., gain ¼ posttest pretest).
GAIN SCORES, ANALYSIS OF However, both the pretest and the posttest scores
for any individual contain some amount of mea-
surement error such that it is impossible to know
Gain (i.e., change, difference) is defined here as a person’s true score on any given assessment.
the difference between test scores obtained for Thus, in classical test theory (CTT), a person’s
an individual or group of individuals from a mea- observed score (X) is composed of two parts, some
surement instrument, intended to measure the true score (T) and some amount of measurement
same attribute, trait, concept, construct, or skill, error (E) as defined in Equation 1:
between two or more testing occasions. This dif-
ference does not necessarily mean that there is X ¼ T þ E: ð1Þ
an increase in the test score(s). Thus, a negative
difference is also described as a ‘‘gain score.’’ In a gain score analysis, it is the change in the true
There are a multitude of reasons for measuring scores (T Þ that is of real interest. However, the
gain: (a) to evaluate the effects of instruction or researcher’s best estimate of the true score is the
other treatments over time, (b) to find variables person’s observed score, thus making the gain
that correlate with change for developing a crite- score (i.e., the difference between observed scores)
rion variable in an attempt to answer questions an unbiased estimator of T for any given individ-
such as ‘‘What kinds of students grow fastest on ual or subject. What follows is a description of
the trait of interest?,’’ and (c) to compare indi- methods for analyzing gain scores, a discussion of
vidual differences in gain scores for the purpose the reliability of gain scores, alternatives to the
of allocating service resources and selecting indi- analysis of gain scores, and a brief overview of
viduals for further or special study. designs that measure change using more than two
The typical and most intuitive approach to the waves of data collection.
calculation of change is to compute the difference
between two measurement occasions. This differ-
Methods for the Analysis of Gain Scores
ence is called a gain score and can be considered
a composite in that it is made up of a pretest (e.g., The gain score can be used as a dependent variable
an initial score on some trait) and a posttest (e.g., in a t test (i.e., used to determine whether the
a score on the same trait after a treatment has been mean difference is statistically significant for
implemented) score where a weight of 1 is assigned a group or whether the mean differences between
to the posttest and a weight of 1 is assigned to two groups are statistically significantly different)
the pretest. Therefore, the computation of the gain or an analysis of variance (ANOVA) (i.e., used
score is simply the difference between posttest and when the means of more than two groups or more
519
520 Gain Scores, Analysis of
than two measurement occasions are compared) Reliability of the Gain Score
with the treatment, intervention, instructional
mode (i.e., as with educational research) or natu- Frederic M. Lord and Melvin R. Novick intro-
rally occurring group (e.g., sex) serving as the duced the reliability of the gain score as the ratio
between-subjects factor. (For simplicity, through- of the variance of the difference score (σ 2D Þ to the
out this entry, levels of the between-groups factors sum of the variance of the difference score and the
are referred to as treatment groups. However, the variance of the error associated with that differ-
information provided also applies to other types of ence (σ 2D þ σ 2errD Þ:
groups as well, such as intervention, instructional
σ 2D
modes, and naturally occurring.) If the t test or the ρD ¼ : ð2Þ
treatment main effect in an ANOVA is significant, σ 2D þ σ 2errD
the null hypothesis of no significant gain or differ-
ence in improvement between groups (e.g., treat- Hence, the variance of the difference score is
ment and control groups) can be rejected. the systematic difference between subjects in their
gain score. In other words, the reliability of the
gain score is really a way to determine whether the
t Test assessment or treatment discriminates between
those who change a great deal and those who
Depending on the research question and design, change little, and to what degree. The reliability of
either a one-sample t test or an independent sam- the gain score can be further described in terms of
ples t test can be conducted using the gain score as the pretest and posttest variances along with their
the dependent variable. A one-sample t test can be respective reliabilities and the correlation of the
used when the goal is to determine whether the pretest with the posttest. Equation 3 describes this
mean gain score is significantly different from zero relationship from the CTT perspective, where
or some other specified value. When two groups observations are considered independent.
(e.g., control and treatment) are included in the
research design and the aim is to determine σ 2pre ρpre þ σ 2post ρpost 2σ pre σ post ρpre;post
whether more gain is observed in the treatment ρD ¼ ; ð3Þ
σ 2pre þ σ 2post 2σ pre σ post ρpre;post
group, for example, an independent t test can be
implemented to determine whether the mean gain
where ρD represents the reliability of the gain score
scores between groups are significantly different
(D) and σ 2pre and σ 2post designate the variance of the
from each other. In this context, the gain score is
pretest and posttest scores, respectively. Likewise,
entered as the dependent variable and more than
σ pre and σ post designate the standard deviations of
two groups would be examined (e.g., a control
the pretest and posttest scores, respectively, and
and two different treatment groups).
ρpre and ρpost represent the reliabilities of the pre-
test and posttest scores, respectively. Lastly,
ρpre;post designates the correlation between the
Analysis of Variance
pretest and posttest scores. Equation 3 further
Like the goal of an independent t test, the aim reduces to Equation 4:
of an ANOVA is to determine whether the mean
gain scores between groups are significantly differ- ρpre þ ρpost 2ρpre;post
ρD ¼ ð4Þ
ent from each other. Instead of conducting multi- 2 2ρpre;post
ple t tests, an ANOVA is performed when more
than two groups are present in order to control the when the variances of the pretests and posttests
type I error rate (i.e., rate of rejecting a true null are equal (i.e., σ 2pre ¼ σ 2post Þ. However, it is rare
hypothesis). However, differences in pretest scores that equal variances are observed when a treatment
between groups are not controlled for when con- is studied that is intended to show growth between
ducting an ANOVA using the gain scores, which pretesting and posttesting occasions. When growth
can result in misleading conclusions as discussed is the main criterion, this equality should not be
later. considered an indicator of construct validity, as it
Gain Scores, Analysis of 521
has been in the past. In this case, it is merely an have shown that gain scores can be reliable under
indication of whether rank order is maintained certain circumstances that depend upon the experi-
over time. If differing growth rates are observed, mental procedure and the use of appropriate
this equality will not hold. For example, effective instruments. Williams, Zimmerman, and Roy D.
instruction tends to increase the variability within Mazzagatti further discovered that for simple gains
a treatment group, especially when the measure to be reliable, it is necessary that the intervention
used to assess performance has an ample number or treatment be strong and the measuring device
of score points to detect growth adequately (i.e., or assessment be sensitive enough to detect
the scoring range is high enough to prevent ceiling changes due to the intervention or treatment. The
effects). If ceiling effects are present or many stu- question remains, ‘‘How often does this occur in
dents achieve mastery such that scores are concen- practice?’’ Zimmerman and Williams show, by
trated near the top of the scoring scale, the example, that with a pretest assessment that has
variability of the scores declines. a 0.9 reliability, if the intervention increases the
The correlation between pretest and posttest variability of true scores, the reliability of the gain
scores for the treatment group provides an esti- scores will be at least as high as that of the pretest
mate of the reliability (i.e., consistency) of the scores. Conversely, if the intervention reduces the
treatment effect across individuals. When the cor- variability of true scores, the reliability of the gain
relation between the pretest and posttest is one, scores decreases, thus placing its value between the
the reliability of the difference score is zero. This is reliabilities of the pretest and posttest scores.
because uniform responses are observed, and Given these findings, it seems that the use of gain
therefore, there is no ability to discriminate scores in research is not as meek as it was once
between those who change a great deal and those thought. In fact, only when there is no change or
who change little. However, some researchers, a reduction in the variance of the true scores as
Gideon J. Mellenbergh and Wulfert P. van den a result of the intervention(s) is the reliability of
Brink, for example, suggest that this does not the gain score significantly lowered. Thus, when
mean that the difference score should not be pretest scores are reliable, gain scores are reliable
trusted. In this specific instance, a different mea- for research purposes.
sure (e.g., measure of sensitivity) is needed to Although the efficacy of using gain scores has
assess the utility of the assessment or the produc- been historically wrought with much controversy,
tivity of the treatment in question. Such measures as the main arguments against their use are that
may include, but are not limited to, Cohen’s effect they are unreliable and negatively correlated with
size or an investigation of information (i.e., preci- pretest scores, gain scores are currently gaining in
sion) at the subject level. application and appeal because of the resolution of
Additionally, experimental independence (i.e., misconceptions found in the literature on the reli-
the pretest and posttest error scores are uncorre- ability of gain scores. Moreover, depending on the
lated) is assumed by using the CTT formulation of research question, precision may be a better way
reliability of the difference score. This is hardly the to judge the utility of the gain score than reliability
case with educational research, and it is likely that alone.
the errors are positively correlated; thus, the reli-
ability of gain scores is often underestimated. As
a result, in cases such as these, the additivity of
Alternative Analyses
error variances does not hold and leads to an
inflated estimate of error variance for the gain Alternative statistical tests of significance can also
score. Additionally, David R. Rogosa and John B. be performed that do not include a direct analysis
Willett contend that it is not the positive correla- of the gain scores. An analysis of covariance
tion of errors of measurement that inflate the reli- (ANCOVA), residualized gain scores, and the
ability of the gain score, but rather individual Johnson-Neyman technique are examples. Many
differences in true change. other examples also exist but are not presented
Contrary to historical findings, Donald W. Zim- here (see Further Readings for references to these
merman and Richard H. Williams, among others, alternatives).
522 Gain Scores, Analysis of
estimation of individual gain scores. Methods for Zimmerman, D. W., & Williams, R. H. (1998).
analyzing gain scores include, but are not limited Reliability of gain scores under realistic assumptions
to, t tests and ANOVA models. These models about properties of pre-test and post-test scores.
answer the question ‘‘What is the effect of the treat- British Journal of Mathematical and Statistical
Psychology, 51(2), 343351.
ment on change from pretest to posttest?’’ Gain
scores focus on the difference between measure-
ments taken at two points in time and thus repre-
sent an incremental model of change. Ultimately,
multiple waves of data should be considered for the GAME THEORY
analysis of individual change over time because it is
unrealistic to view the process of change as follow- Game theory is a model of decision making
ing a linear and incremental pattern. and strategy under differing conditions of uncer-
tainty. Games are defined as strategic interactions
Tia Sukin between players, where strategy refers to a com-
plete plan of action including all prospective play
See also Analysis of Covariance (ANCOVA); Analysis of
options as well as the player’s associated outcome
Variance (ANOVA); Growth Curves
preferences. The formal predicted strategy for solv-
ing a game is referred to as a solution. The pur-
Further Readings
pose of game theory is to explore differing
Cronbach, L. J., & Furby, L. (1970). How we should solutions (i.e., tactics) among players within games
measure ‘‘change’’—or should we? Psychological of strategy that obtain a maximum of utility. In
Bulletin, 74(1), 6880. game theory parlance, ‘‘utility’’ refers to preferred
Haertel, E. H. (2006). Reliability. In R. Brennan (Ed.), outcomes that may vary among individual players.
Educational measurement (4th ed., pp. 65110). John von Neumann and Oskar Morgenstern
Westport, CT: Praeger.
seeded game theory as an economic explanatory
Johnson, P. O., & Fay, L. C. (1950). The Johnson-
Neyman technique, its theory and application.
construct for all endeavors of the individual to
Psychometrika, 15(4), 349367. achieve maximum utility or, in economic terms,
Knapp, T. R., & Schafer, W. D. (2009). From gain score t profit; this is referred to as a maximum. Since its
to ANCOVA F (and vice versa). Practical Assessment, inception in 1944, game theory has become an
Research & Evaluation, 14(6), 17. accepted multidisciplinary model for social
Lord, F. M., & Novick, M. R. (1968). Statistical exchange in decision making within the spheres of
theories of mental test development. Reading, MA: biology, sociology, political science, business, and
Addison-Wesley. psychology. The discipline of psychology has
Mellenbergh, G. J., & van den Brink, W. P. (1998). The embraced applied game theory as a model for con-
measurement of individual change. Psychological
flict resolution between couples, within families,
Methods, 3(4), 470485.
Rogosa, D., Brandt, D., & Zimowski, M. (1982). A
and between hostile countries; as such, it is also
growth curve approach to the measurement of change. referred to as the theory of social situations.
Psychological Bulletin, 92(3), 726748. Classic game theory as proposed by von Neu-
Rogosa, D. R., & Willett, J. B. (1985). Understanding mann and Morgenstern is a mathematical model
correlates of change by modeling individual founded in utility theory, wherein the game
differences in growth. Psychometrika, 50, 203228. player’s imagined outcome preferences can be
Williams, R. H., & Zimmerman, D. W. (1996). Are combined and weighted by their probabilities.
simple gain scores obsolete? Applied Psychological These outcome preferences can be quantified and
Measurement, 20(1), 5969. are therefore labeled utilities. A fundamental
Williams, R. H., Zimmerman, D. W., & Mazzagatti, R. D.
assumption of von Neumann and Morgenstern’s
(1987). Large sample estimates of the reliability of
simple, residualized, and base-free gain scores. Journal
game theory is that the game player or decision
of Experimental Education, 55(2), 116118. maker has clear preferences and expectations.
Zimmerman, D. W., & Williams, R. H. (1982). Gain Each player is presumed rational in his or her
scores in research can be highly reliable. Journal of choice behavior, applying logical heuristics in
Educational Measurement, 19(2), 149154. weighing all choice options, thereby formulating
524 Game Theory
his or her game strategy in an attempt to optimize aware of all actions during the game; in other
the outcome by solving for the maximum. These words, if Player A moves a pawn along a chess
game strategies may or may not be effective in board, Player B can track that pawn throughout
solving for the maximum; however, the reasoning the game. There are no unknown moves. This is
in finding a solution must be sound. referred to as perfect information because there
is no uncertainty, and thus, games of perfect
information have few conceptual problems; by
Game Context
and large, they are considered technical pro-
Three important contextual qualities of a game blems. In contrast, games of imperfect informa-
involve whether the game is competitive or non- tion involve previously unknown game plans;
competitive, the number of players involved in the consequently, players are not privy to all previ-
game, and the degree to which all prior actions are ously employed competitive strategies. Games of
known. imperfect information require players to use
Games may be either among individuals, Bayesian interpretations of others’ actions.
wherein they are referred to as competitive
games, or between groups of individuals, typi-
The Role of Equilibrium
cally characterized as noncompetitive games.
The bulk of game theory focuses on competitive Equilibrium refers to a stable outcome of a game
games of conflict. The second contextual quality associated with two or more strategies and by
of a game involves player or group number. extension, two or more players. In an equilib-
Although there are situations in which a single rium state, player solutions are balanced, the
decision maker must choose an optimal solution resources demanded and the resources available
without reference to other game players (i.e., are equal, this means that one of the two parties
human against nature), generally, games are will not optimize. John Forbes Nash provided
between two or more players. Games of two a significant contribution to game theory by pro-
players or groups of players are referred to as posing a conceptual solution to analyze strategic
two-person or two-player (where ‘‘player’’ may interactions and consequently the strategic
reflect a single individual or a single group of options for each game player, what has come to
individuals) games; these kinds of games are be called Nash equilibrium. This equilibrium is
models of social exchange. In a single-person a static state, such that all players are solving for
model, the decision maker controls all variables optimization and none of the players benefit
in a given problem; the challenge in finding an from a unilateral strategy change. In other
optimal outcome (i.e., maximum) is in the num- words, in a two-person game, if player A
ber of variables and the nature of the function to changes strategy and player B does not, player A
be maximized. In contrast, in two-person, two- has departed from optimization; the same would
player or n-person, n-player games (where ‘‘n’’ is be true if player B changed strategy in the
the actual number of persons or groups greater absence of a strategy change by player A. Nash
than two), the challenge of optimization hinges equilibrium of a strategic game is considered sta-
on the fact that each participant is part of a social ble because all players are deadlocked, their
exchange, where each player’s outcome is inter- interests are evenly balanced and in the absence
dependent on the actions of all other players. of some external force, like a compromise, they
The variables in a social exchange economy are are unlikely to change their tactical plan. Nash
the weighted actions of all other game players. equilibrium among differing strategic games has
The third important quality of a game is the become a heavily published area of inquiry
degree to which players are aware of other within game theory.
players’ previous actions or moves within
a game. This awareness is referred to as informa-
Types of Games
tion, and there are two kinds of game informa-
tion: perfect and imperfect. Games of perfect Games of strategy are typically categorized as
information are those in which all players are zero-sum games (also known as constant-sum
Game Theory 525
games), nonzero-sum competitive games, and Table 1 Outcome Matrix for a Standard Game of
nonzero-sum cooperative games; within this lat- Chicken
ter category are also bargaining games and coali-
tional games. Driver B
Yield Maintain
Zero-Sum Games Driver A Yield 0, 0 1, þ1
A defining feature of zero-sum games is that Driver B wins
they are inherently win-lose games. Games of Maintain þ1, 1 0, 0
strategy are characterized as zero-sum or con- Driver A wins
stant-sum games if the additive gain of all
players is equal to zero. Two examples of zero-
sum games are a coin toss or a game of chicken.
NonZero-Sum Competitive Games
Coin tosses are strictly competitive zero-sum
games, where a player calls the coin while it is Nonzero-sum games of strategy are character-
still aloft. The probability of a head or a tail is ized as situations in which the additive gain of all
exactly 50:50. In a coin toss, there is an absence players is either more than or less than zero. Non-
of Nash equilibrium; given that there is no way zero-sum games may yield situations in which
to anticipate accurately what the opposing players are compelled by probability of failure to
player will choose, nor is it possible to predict depart from their preferred strategy in favor of
the outcome of the toss, there exists only one another strategy that does the least violence to
strategic option—choose. Consequently, the pay- their outcome preference. This kind of decisional
off matrix in a coin toss contains only two vari- strategy is referred to as a minimax approach, as
ables: win or lose. If Player A called the toss the player’s goal is to minimize his or her maxi-
inaccurately, then his or her net gain is 1, and mum loss. However, the player may also select
player B’s gain was +1. There is no draw. Player a decisional strategy that yields a small gain. In
A is the clear loser, and Player B is the clear win- this instance, he or she selects a strategic solution
ner. However, not all zero-sum outcomes are that maximizes the minimum gain; this is referred
mutually exclusive; draws can occur, for exam- to as a maximin. In two-person, nonzero-sum
ple, in a two-player vehicular game of chicken, games, if the maximin for one player and the mini-
where there are two car drivers racing toward max for another player are equal, then the two
each other. The goal for both drivers is to avoid players have reached Nash equilibrium. In truly
yielding to the other driver; the first to swerve competitive zero-sum games, there can be no Nash
away from the impending collision has lost the equilibrium. However, in nonzero-sum games,
game. In this game, there are four possible out- Nash equilibria are frequently achieved; in fact, an
comes: Driver A yields, Driver B yields, neither analogous nonzero-sum game of zero-sum
Driver A nor B yields, or both Driver A and chicken results in two Nash equilibria. Biologists
Driver B simultaneously swerve. However, and animal behaviorists generally refer to this
for Driver A, there are only two strategic game as Hawk-Dove in reference to the aggressive
options: Optimize his or her outcome ðþ1Þ or strategies employed by the two differing species of
optimize the outcome for Driver B (1); Player birds.
B has these same diametrically opposed options. Hawk-Dove games are played by a host of ani-
Note that optimizing for the maximum is defined mal taxa, including humans. Generally, the context
as winning. Yet surviving in the absence of a win for Hawk-Dove involves conspecific species com-
is losing, and dying results in a forfeited win, so peting for indivisible resources such as access to
there is no Nash equilibrium in this zero-sum mating partners, such that an animal must employ
game, either. Table 1 reflects the outcome an aggressive Hawk display and attack strategy or
matrix for each driver in a game of chicken, a noncommittal aggressive display reflective of
where the numeric values represent wins (þ) and a Dove strategy. The Hawk tactic is a show of
losses (). aggressive force in conjunction with a commitment
526 Game Theory
Table 2 Outcome Matrix for the Hawk-Dove Game Table 3 Outcome Matrix for the Prisoner’s Dilemma
Conspecific B Prisoner B
Hawk Dove Defect Cooperate
Conspecific A Hawk 1, 1 4, 2 Prisoner A Defect 5, 5 0, 10
Dove 2, 4 3, 3 Cooperate 10, 0 10, 10
to follow through on the aggressive display with each prisoner the same deal. The Prisoner’s
an attack. The Dove strategy employs a display of Dilemma prisoner has two options: cooperate
aggressive force without a commitment to follow (i.e., remain silent) or defect (i.e., confess). Each
up the show, thereby fleeing in response to a com- prisoner’s outcome is dependent not only on his
petitive challenge. The goal for any one animal is or her behavior but the actions of his or her
to employ a Hawk strategy while the competitor accomplice. If Prisoner A defects (confesses)
uses a Dove strategy. Two Hawk strategies result while Prisoner B cooperates (remains silent),
in combat, although in theory, the escalated Prisoner A is freed and turns state’s evidence,
aggression will result in disproportionate injury and Prisoner B receives a full 10-year prison sen-
because the animals will have unequivalent com- tence. In this scenario, the police have sufficient
bative skills; hence, this is not truly a zero-sum evidence to convict both prisoners on a lighter
game. Despite this, it is assumed that the value of sentence without their shared confessions, so if
the disputed resource is less than the cost of com- Prisoners A and B both fail to confess, they both
bat. Therefore, two Hawk strategies result in receive a 5-year sentence. However if both sus-
a minimax and two Dove strategies result in a maxi- pects confess, they each receive the full prison
min. In this situation, the pure strategy of Hawk- sentence of 10 years. Table 3 reflects the Prison-
Dove will be preferred for each player, thereby er’s Dilemma outcome matrix, where the
resulting in two Nash equilibria for each conspecific numeric values represent years in prison.
(Hawk, Dove and Dove, Hawk). Table 2 reflects This example of Prisoner’s Dilemma does con-
the outcome matrix for the Hawk-Dove strategies, tain a single Nash equilibrium (defect, defect),
where a 4 represents the greatest risk-reward payoff where both suspects optimize by betraying their
and a 1 reflects the lowest payoff. accomplice, providing the police with a clear
Although game theorists are not necessarily advantage in extracting confessions.
interested in the outcome of the game as much as
the strategy employed to solve for the maximum,
NonZero-Sum Cooperative Games
it should be noted that pure strategies (i.e., always
Dove or always Hawk) are not necessarily the Although game theory typically addresses situa-
most favorable approach to achieving optimiza- tions in which players have conflicting interests,
tion. In the case of Hawk-Dove, a mixed approach one way to maximize may be to modify one’s
(i.e., randomization of the different strategies) is strategy to compromise or cooperate to resolve the
the most evolutionarily stable strategy in the long conflict. Nonzero-sum games within this cate-
run. gory broadly include Tit-for-Tat, bargaining
Nonzero-sum games are frequently social games, and coalition games. In nonzero-sum
dilemmas wherein private or individual interests cooperative games, the emphasis is no longer on
are at odds with those of the collective. A classic individual optimization; the maximum includes
two-person, nonzero-sum social dilemma is Pris- the optimization interests of other players or
oner’s Dilemma. Prisoner’s Dilemma is a strategic groups of players. This equalizes the distribution
game in which the police concurrently interrogate of resources among two or more players. In coop-
two criminal suspects in separate rooms. In an erative games, the focus shifts from more individu-
attempt to collect more evidence supporting their alistic concept solutions to group solutions. In
case, the police strategically set each suspect in game theory parlance, players attempt to maxi-
opposition. The tactic is to independently offer mize their minimum loss, thereby selecting the
Game Theory 527
maximin and distributing the resources to one or If, after the second Tat, the first player has not
more other players. Generally, these kinds of coop- corrected his or her strategy back to one of
erative games occur between two or more groups cooperation, then the second player responds
that have a high probability of repeated interaction with a retaliatory counter.
and social exchange. Broadly, nonzero-sum Another type of nonzero-sum cooperative
cooperative games use the principle of reciprocity game falls within the class of games of negotiation.
to optimize the maximum for all players. If one Although these are still games of conflict and strat-
player uses the maximin, the counter response egy, as a point of disagreement exists, the game is
should follow the principle of reciprocity and the negotiation. Once the bargain is proffered and
respond in kind. An example of this kind of non- accepted, the game is over. The most simplistic of
zero-sum cooperative game is referred to as Tit- these games is the Ultimatum Game, wherein two
for-Tat. players discuss the division of a resource. The first
Anatol Rapoport submitted Tit-for-Tat as player proposes the apportionment, and the sec-
a solution to a computer challenge posed by Uni- ond player can either accept or reject the offer.
versity of Michigan political science professor The players have one turn at negotiation; conse-
Robert Axelrod in 1980. Axelrod solicited the quently, if the first player values the resource, he
most renowned game theorists of academia to or she must make an offer that is perceived, in the-
submit solutions for an iterated Prisoner’s ory, to be reasonable by the second player. If the
Dilemma, wherein the players (i.e., prisoners) offer is refused, there is no second trial of nego-
were able to retaliate in response to the previous tiations and neither player receives any of the
tactic of their opposing player (i.e., accomplice). resource. The proposal is actually an ultimatum
Rapoport’s Tit-for-Tat strategy succeeded in of ‘‘take it or leave it.’’ The Ultimatum Game is
demonstrating optimization. Tit-for-Tat is a pay- also a political game of power. The player who
back strategy typically between two players and proposes the resource division may offer an
founded on the principle of reciprocity. It begins unreasonable request, and if the second player
with an initial cooperative action by the first maintains less authority or control, the lack of
player; henceforth, all subsequent actions reflect any resource may be so unacceptable that both
the last move of the second player. Thus, if the players yield to the ultimatum. In contrast to the
second player responds cooperatively (Tat), then Ultimatum Game, bargaining games of alternat-
the first player responds in kind (Tit) ad infini- ing offers is a game of repeated negotiation and
tum. Tit-for-Tat was not necessarily a new con- perfect information where all previous negotia-
flict resolution approach when submitted by tions may be referenced and revised, and where
Rapoport, but one favored historically as a mili- the players enter a state of equilibrium through
taristic strategy using a different name, equiva- a series of trials consisting of offers and counter-
lent retaliation, which reflected a Tit-for-Tat offers until eventually an accord is reached and
approach. Each approach is highly vulnerable; it the game concludes.
is effective only as long as all players are infalli- A final example of a nonzero-sum cooperative
ble in their decisions. Tit-for-Tat fails as an opti- game is a game of coalitions. This is a cooperative
mum strategy in the event of an error. If a player game between individuals within groups. Coali-
makes a mistake and accidentally defects from tional games are games of consensus, potentially
the cooperative concept solution to a competitive through bargaining, wherein the player strategies
action, then conflict ensues and the nonzero- are conceived and enacted by the coalitions. Both
sum cooperative game becomes one of competi- the individual players and their group coalitions
tion. A variant of Tit-for-Tat, Tit-for-Two-Tats, have an interest in optimization. However, the
is more effective in optimization as it reflects existence of coalitions protects players from
a magnanimous approach in the event of an acci- defecting individuals; the coalition maintains the
dental escalation. In this strategy, if one player power to initiate a concept solution and thus all
errs through a competitive action, the second plays of strategy.
player responds with a cooperative counter,
thereby inviting remediation to the first player. Heide Deditius Island
528 Gauss–Markov Theorem
of the errors be completely specified, whereas the of a non-singular matrix Q satisfying the equa-
Gauss–Markov theorem does not require full tion Ωx ¼ xQ: Because such complex necessary
specification of the error distribution. Second, and sufficient conditions offer little intuition for
the maximum likelihood estimator offers only assessing when a given model satisfies them, in
asymptotic (large sample) properties, whereas applied work, looser sufficient conditions, which
the properties of BLU estimators hold in finite are easier to assess, are usually employed.
(small) samples. Because these practical conditions are typically
not also necessary conditions, it means they are
generally stricter than is theoretically required
Ordinary Least Squares Estimation
for OLS to be BLU. In other words, there may
The ordinary least squares estimator calculates β ^ be models that do not meet the sufficient condi-
by minimizing the sum of the squared residuals μ ^. tions for OLS to be BLU, but where OLS is
However, without further assumptions, one cannot nonetheless BLU.
know how accurately OLS estimates β. These Now, this entry turns to the sets of sufficient
further assumptions are provided by the Gauss– conditions that are most commonly employed
Markov theorem. for two different types of regression models:
The OLS estimator has several attractive qual- (a) models where x is fixed in repeated sampling,
ities. First, the Gauss–Markov theorem ensures which is appropriate for experimental research,
that it is the BLU estimator given that certain and (b) models where x is allowed to vary from
conditions hold, and these properties hold even sample to sample, which is more appropriate for
in small sample sizes. The OLS estimator is easy observational (nonexperimental) data.
to calculate and is guaranteed to exist if the
Gauss–Markov asssumptions hold. The OLS
regression line can also be intuitively understood
as the expected value of y for a given value of x: Gauss–Markov Conditions for
However, because OLS is calculated using Experimental Research (Fixed x)
squared residuals, it is also especially sensitive to
outliers, which exert a disproportionate influ- In experimental studies, the researcher has con-
ence on the estimates. trol over the treatment administered to subjects.
This means that in repeated experiments with the
same size sample, the researcher would be able to
The Gauss–Markov Theorem ensure that the subjects in the treatment group get
The Gauss–Markov theorem specifies conditions the same level of treatment. Because this level of
under which ordinary least squares estimators treatment is essentially the value of the indepen-
are also best linear unbiased estimators. Because dent variable x in a regression model, this is equiv-
these conditions can be specified in many ways, alent to saying that the researcher is able to hold x
there are actually many different Gauss–Markov fixed in repeated samples. This provides a much
theorems. First, there is the theoretical ideal of simpler data structure in experiments than is possi-
necessary and sufficient conditions. These neces- ble in observational data, where the researcher
sary and sufficient conditions are usually devel- does not have complete control over the value of
oped by mathematical statisticians and often the independent variable x: The following condi-
specify conditions that are not intuitive or prac- tions are sufficient to ensure that the OLS estima-
tical to apply in practice. For example, the most tor is BLU when x is fixed in repeated samples:
widely cited necessary and sufficient conditions
1. Model correctly specified
for the Gauss–Markov theorem, which Simo
Puntanen and George Styan refer to as ‘‘Zys- 2. Regressors not perfectly collinear
kind’s condition,’’ states in matrix notation that 3. E(μ) ¼ 0
a necessary and sufficient condition for OLS to
4. Homoscedasticity
be BLU with fixed x and nonsingular variance-
covariance (dispersion) matrix Ω is the existence 5. No serial correlation
Gauss–Markov Theorem 531
heteroscedasticity because it can take many differ- Gauss–Markov theorem to observational data,
ent forms. For example, in its simplest form, the perhaps to reiterate the potential specification
error of one observation is correlated only with problem of omitted confounding variables when
the error in the next observation. For such pro- there is not random assignment to treatment and
cesses, Aitken’s generalized least squares can be control groups.
used to achieve BLU estimates if the other Gauss–
Markov assumptions hold. If, however, errors are Homoscedastic Errors, Eðμ2 jxÞ ¼ σ 2
associated with the errors of more than one other
The restriction on heteroscedasticity of the
observation at a time, then more sophisticated
errors is also strengthened in the case where x is
time-series models are more appropriate, and in
not fixed. In the fixed-x case, the error of each
these cases, the Gauss–Markov theorem cannot be
observation was required to have the same vari-
applied. Fortunately, if the sample is drawn ran-
ance. In the arbitrary-x case, the errors are also
domly, then the errors automatically will be uncor-
required to have the same variance across all possi-
related with each other, so that there is no need to
ble values of x: This is tantamount to requiring
worry about serial correlation.
that the variance of the errors not be a (linear or
nonlinear) function of x: Again, as in the fixed-x
Gauss–Markov Assumptions for case, violations of this assumption will still yield
Observational Research (Arbitrary x) unbiased estimates of regression coefficients as
long as the first three assumptions hold. But such
A parallel but stricter set of Gauss–Markov heteroscedasticity will yield inefficient estimates
assumptions is typically applied in practice in the unless the heteroscedasticity is addressed in the
case of observational data, where the researcher way discussed in the fixed-x section.
cannot assume that x is fixed in repeated samples.
No Serial Correlation, Eðμi μj |xÞ ¼ 0
1. Model correctly specified
Finally, the restriction on serial correlation in
2. Regressors not perfectly collinear
the errors is strengthened to prohibit serial correla-
3. E(μ|xÞ ¼ 0 tion that may be a (linear or nonlinear) function of
4. Homoscedastic errors, E(μ2 |xÞ ¼ σ 2 x: Violations of this assumption will still yield
unbiased least-squares estimates, but these esti-
5. No serial correlation, E(μi μj |xÞ ¼ 0 mates will not have the minimum variance among
all unbiased linear estimators. In particular, it is
The first two assumptions are exactly the same as possible to reduce the variance by taking into
in the fixed-x case. The other three sufficient con- account the serial correlation in the weighting of
ditions are augmented so that they hold condi- observations in least squares. Again, if the sample
tional on the value of x: is randomly drawn, then the errors will automati-
cally be uncorrelated with each other. In time-
E(μjxÞ ¼ 0
series data in particular, it is usually inappropriate
In contrast to the fixed-x case, E(μ|xÞ ¼ 0 is to assume that the data are drawn as a random
a very strong assumption that means that in addi- sample, so special care must be taken to ensure
tion to having zero expectation, the errors are not that E(μi μj |xÞ ¼ 0 before employing least squares.
associated with any linear or nonlinear function of In most cases, it will be inapproriate to use least
^ is not only unbiased, but also
x: In this case, β squares for time-series data. However, F. W. McEl-
unbiased conditional on the value of x; a stronger roy has provided a useful set of necessary and suf-
form of unbiasedness than is strictly needed for ficient conditions for the Gauss–Markov theorem.
BLU estimation. Indeed, there is some controversy For models with a y-intercept and a very simple
about whether this assumption must be so much form of serial correlation known as exchangeabil-
stronger than in the fixed-x case, given that ity, the OLS estimator is still the best linear unbi-
the model is correctly specified. Nevertheless, ased estimator. A useful necessary and sufficient
E(μ|x) ¼ 0 is typically used in applying the condition for the Gauss–Markov theorem in the
Generalizability Theory 533
score is accurate. The universe score is a G-theory a generalizability study (G study), where the
analogue of the true score in CTT and is defined as observed score variance is decomposed into pieces
the average score a candidate would have obtained attributable to different sources of score variability
across an infinite number of testing under called variance components associated with
measurement conditions that the investigator is a facet(s) identified by the investigator.
willing to accept as exchangeable with one another As shown in detail in the numerical example
(called randomly parallel measures). Suppose, for below, G-study variance component estimates are
example, that an investigator has a large number typically obtained by fitting a random-effects
of vocabulary test items. The investigator might ANOVA model to data. The primary purpose of
feel comfortable treating these items as randomly this analysis is to obtain mean squares for different
parallel measures because trained item writers effects that are needed for the calculation of vari-
have carefully developed these items to target a spe- ance component estimates. Variance component
cific content domain, following test specifications. estimates are key building blocks of G theory. A
The employment of randomly parallel measures is G-study variance component estimate indicates the
a key assumption of G theory. Note the difference magnitude of the effect of a given source of vari-
of this assumption from the CTT assumption, ability on the observed score variance for a hypo-
where sets of scores that are involved in a reliability thetical measurement design where only a single
calculation must be statistically parallel measures observation is used for testing (e.g., a test consist-
(i.e., two sets of scores must share the same mean, ing of one item).
the same standard deviation, and the same correla- The G-study variance component estimates are
tion to a third measure). then used as the baseline data in the second step of
Observed test scores can vary for a number of the analysis called a decision study (D study). In
reasons. One reason may be the true differences a D study, variance components and measurement
across candidates in terms of the ability of interest reliability can be estimated for a variety of hypo-
(called the object of measurement). Other reasons thetical measurement designs (for instance, a test
may be the effects of different sources of measure- consisting of multiple items) and types of score
ment error: Some are systematic (e.g., item diffi- interpretations of interest.
culty), whereas others are unsystematic (e.g.,
fatigue). In estimating a candidate’s universe score,
A Numerical Example: One-Facet
one cannot test the person an infinite number of
times in reality. Therefore, one always has to esti- Crossed Study Design
mate a candidate’s universe score based on a lim- Suppose that an investigator wants to analyze
ited number of measurements available. In G results of a grammar test consisting of 40 items
theory, a systematic source of variability that may administered to 60 students in a French language
affect the accuracy of the generalization one makes course. Because these items have been randomly
is called a facet. There are two types of facets. A selected from a large pool of items, the investigator
facet is random if the intention is to generalize defines items as a random facet. In this test, all
beyond the conditions actually used in an assess- candidates (persons) complete all items. In G-the-
ment. In this case, measurement conditions are ory terms, this study design is called a one-facet
conceptualized as a representative sample of study design because it involves only one facet
a much larger population of admissible observa- (items). Moreover, persons and items are called
tions (called the universe in G theory). Alterna- crossed because for each person, scores for all
tively, a facet is fixed when there is no intention to items are available (denoted p × i; where the ‘‘ × ’’
generalize beyond the conditions actually used in is read ‘‘crossed with’’).
the assessment, because either the set of measure- For this one-facet crossed study design, the
ment conditions exhausts all admissible observa- observed score variance is decomposed into three
tions in the universe or the investigator has chosen variance components:
specific conditions on purpose.
Different sources of measurement error are ana- 1. Person variance component [σ 2 ðpÞ]: The
lyzed in a two-step procedure. The first step is observed score variance due to the true
Generalizability Theory 535
Person variance Item variance Table 1 ANOVA Table for a One-Facet Crossed
component component Study Design
Degrees of Sum of Mean
Source Freedom (df) Squares (SS) Squares (MS)
Persons (p) 59 96.943 1.643
Items (i) 39 173.429 4.457
pi,e 2,301 416.251 0.181
σ 2(p) σ 2(pi,e) σ 2(i)
Total 2,399 686.623
a single observation. As can be seen in the table, except that for the object of measurement [σ 2 ðpÞ]
the person, item, and residual variance compo- contribute to the absolute error variance [σ 2 (Abs)].
nents account for 12.8%, 24.6%, and 62.6% of In this example, both σ 2 ðiÞ and σ 2 (pi,e) will
the total score variance, respectively. contribute to the absolute error variance. Thus,
Based on the G-study results above, a D study σ 2 ðAbsÞ ¼ σ 2 ðiÞ þ σ 2 ðpi; eÞ ¼ 0:001 þ 0:004 ¼
can be conducted to estimate score reliability for 0:005: Second, a G-coefficient or a phi-coefficient
an alternative measurement design. As in CTT, is obtained by dividing the variance component
where the Spearman-Brown prophecy formula is due to the object of measurement [σ 2 ðpÞ], which is
used to estimate test reliability for different test also called the universe-score variance, by the sum
lengths, one can estimate the measurement reliabil- of itself and the appropriate type of error variance.
ity for a test involving different numbers of items. Thus, for this one-facet crossed study example, the
As an example, the right panel of Table 2 shows G- and phi-coefficients are calculated as follows:
the D-study results for 50 items. First, D-study var-
iance component estimates for this measurement Eρ2 ¼ σ 2 ðpÞ=½σ 2 ðpÞ þ σ 2 ðRelÞ
design are obtained by dividing the G-study vari-
¼ 0:037=ð0:037 þ 0:004Þ ¼ 0:902
ance component estimates associated with the
facet of measurement [i.e., σ 2 ðiÞ and σ 2 (pi,e) in φ ¼ σ 2 ðpÞ=½σ 2 ðpÞ þ σ 2 ðAbsÞ
this case] by the D-study sample size for the item ¼ 0:037=ð0:037 þ 0:005Þ ¼ 0:881
facet (ni0 = 50).
Second, a summary index of reliability, similar
G theory is conceptually related to CTT.
to what one might obtain in a CTT analysis, can
Under certain conditions, CTT and G theory
be calculated for the 50-item scenario. G theory
analysis results yield identical results. This is the
provides two types of reliability-like indexes for
case when a one-facet crossed study design is
different score interpretations: a generalizability
employed for relative decisions. Thus, for exam-
coefficient (denoted Eρ2 Þ for relative decisions,
ple, the G-coefficient obtained from the one-
and an index of dependability (denoted φ; often
facet D study with 50 items above is identical to
called phi coefficient) for absolute decisions. These
Cronbach’s alpha for the same number of items.
coefficients are obtained in two steps. First, the
error variance appropriate for the type of decision
is calculated. For relative decisions, all variance
Other Study Designs
components involving persons, except the object
of measurement [σ 2 ðpÞ], contribute to the relative The numerical example above is one of the
error variance [σ 2 (Rel)]. Thus, for this one-facet simplest designs that can be implemented in a
crossed study example, only the residual variance G-theory data analysis. Below are some examples
component [σ 2 (pi,e)] contributes to the relative of crossed study designs involving multiple ran-
error variance; hence, σ 2 (Rel) ¼ σ 2 (pi,e) ¼ 0.004. dom facets as well as other study designs involving
For absolute decisions, all variance components a nested facet or a fixed facet.
Generalizability Theory 537
A Crossed Study Design With Two Random Facets study example, that each candidate response is
evaluated on three dimensions: pronunciation,
As mentioned above, one can take advantage of
grammar, and fluency. These dimensions can be
the strength of G theory when multiple sources of
best conceptualized as the levels in a fixed facet
error are modeled simultaneously. Suppose, for
because they have been selected as the scoring cri-
example, in a speaking test each student completes
teria on purpose.
three items, and two raters score each student’s
There are some alternatives to model such fixed
responses to all three items. In this case, the inves-
facets in G theory. Whichever approach is
tigator may identify two facets: items and raters.
employed, the decision for selecting an approach
Persons, items, and raters are crossed with one
must be made based on careful considerations of
another because (a) all students complete all items,
various substantive issues. One approach is to con-
(b) all students are rated by both raters, and
duct a two-facet crossed study (p × r × iÞ for each
(c) both raters score student responses to all items.
dimension separately. This approach is preferred if
This study design is called a two-facet crossed
the investigator believes that the three dimensions
study design (p × r × iÞ.
are conceptually so different that study results can-
not be interpreted meaningfully at the aggregated
A Study Design Involving a Nested Facet level, or if the variance component estimates vary
Some G theory analyses may be conducted for widely across the dimensions. Alternatively, one
study designs involving a nested facet. Facet A is can analyze all dimensions simultaneously by con-
nested within Facet B if different, multiple levels of ducting a three-facet crossed study (p × r × i × d;
Facet A are associated with each level of Facet B where dimensions, or d; are treated as a fixed
(Shavelson & Webb). Typically, a nested facet is facet). This approach is reasonable if variance
found in two types of situations. The first is when component estimates averaged across the dimen-
one facet is nested within another facet by defini- sions can offer meaningful information for a partic-
tion. A common example is a reading test consist- ular assessment context, or if the variance
ing of groups comprehension items based on component estimates obtained from separate anal-
different passages. In this case, the item facet is yses of the dimensions in the p × r × i study runs
nested within the reading passage facet because are similar across the dimensions.
a specific group of items is associated only with Another possible approach is to use multivariate
a particular passage. The second is a situation G theory. Although multivariate G theory is
where one chooses to use a nested study design, beyond the scope of this introduction to G theory,
although employing a crossed study design is pos- Robert Brennan’s 2001 volume in the Further
sible. For instance, collecting data for the two- Readings list provide an extensive discussion of
facet crossed study design example above can be this topic.
resource intensive because all raters have to score
all student responses. In this case, a decision might
be made to have different rater pairs score differ-
Computer Programs
ent items to shorten the scoring time. This results
in a two-facet study design where raters are nested Computer programs specifically designed for G
within items [denoted p × ðr : iÞ, where the ‘‘:’’ is theory analyses offer comprehensive output for
read ‘‘nested within’’]. both G and D studies for a variety of study
designs. Brennan’s GENOVA Suite offers three
Study Designs Involving a Fixed Facet programs: GENOVA, urGENOVA, and mGE-
NOVA. GENOVA and urGENOVA handle differ-
Because G theory is essentially a measurement ent study designs for univariate G-theory analyses,
theory for modeling random effects, at least one on which this entry has focused, whereas mGE-
facet identified in a study design must be a random NOVA is designed for multivariate G-theory
effect. Multifacet study designs may involve one or analyses.
more fixed facets, however. Suppose, in the speak-
ing test described earlier in the two-facet crossed Yasuyo Sawaki
538 General Linear Model
See also Analysis of Variance (ANOVA); Classical Test observations are stored in an I by 1 vector denoted
Theory; Coefficient Alpha; Interrater Reliability; y. The values of the independent variables describ-
Random Effects Models; Reliability ing the I observations are stored in an I by K
matrix denoted X: K is smaller than I; and X is
assumed to have rank K (i.e., X is full rank on its
Further Readings
columns). A quantitative independent variable can
Brennan, R. L. (1992). Elements of generalizability be directly stored in X; but a qualitative indepen-
theory. Iowa City, IA: ACT. dent variable needs to be recoded with as many
Brennan, R. L. (2001). Generalizability theory. New columns as there are degrees of freedom for this
York: Springer-Verlag. variable. Common coding schemes include dummy
Cronbach, L. J., Gleser, G. C., Nanda, H., & coding, effect coding, and contrast coding.
Rajaratnam, N. (1972). The dependability of
behavioral measurements: Theory of generalizability
for scores and profiles. New York: Wiley. Core Equation
Shavelson, R. J., & Webb, N. M. (1991). Generalizability
theory: A primer. Newbury Park, CA: Sage. For the GLM, the values of the dependent vari-
Webb, N. M., & Shavelson, R. J. (1981). Multivariate able are obtained as a linear combination of the
generalizability of general educational development values of the independent variables. The vectors
ratings. Journal of Educational Measurement, 18(1), for the coefficients of the linear combination are
1322. stored in a K by 1 vector denoted b. In general, the
Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983).
values of y cannot be perfectly obtained by a linear
Multivariate generalizability theory. In L. J. Fyans, Jr.
combination of the columns of X, and the differ-
(Ed.), Generalizability theory: Inferences and practical
applications (pp. 6781): San Francisco: Jossey-Bass. ence between the actual and the predicted values is
called the prediction error. The values of the error
are stored in an I by 1 vector denoted e. Formally,
the GLM is stated as
GENERAL LINEAR MODEL y ¼ Xb þ e: ð1Þ
The general linear model (GLM) provides a general The predicted values are stored in an I by 1 vector
framework for a large set of models whose com- denoted ^y, and therefore, Equation 1 can be
mon goal is to explain or predict a quantitative rewritten as
dependent variable by a set of independent vari-
y ¼ ^y þ e with ^y ¼ Xb: ð2Þ
ables that can be categorical or quantitative. The
GLM encompasses techniques such as Student’s t Putting together Equations 1 and 2 shows that
test, simple and multiple linear regression, analysis
of variance, and covariance analysis. The GLM is e ¼ y ^y: ð3Þ
adequate only for fixed-effect models. In order to
take into account random-effect models, the GLM Additional Assumptions
needs to be extended and becomes the mixed- The independent variables are assumed to be
effect model. fixed variables (i.e., their values will not change
for a replication of the experiment analyzed by the
GLM, and they are measured without error). The
Notations
error is interpreted as a random variable; in addi-
Vectors are denoted with boldface lower-case let- tion, the I components of the error are assumed to
ters (e.g., y), and matrices are denoted with bold- be independently and identically distributed
face upper-case letters (e.g., XÞ. The transpose of (i.i.d.), and their distribution is assumed to be
a matrix is denoted by the superscript T , and the a normal distribution with a zero mean and a vari-
inverse of a matrix is denoted by the super- ance denoted σ 2e . The values of the dependent vari-
script 1 . There are I observations. The values of able are assumed to be a random sample of
a quantitative dependent variable describing the I a population of interest. Within this framework,
General Linear Model 539
Least Square Estimate By contrast, the ratio of the model sum of squares
SS
to the error variance model
σ2
is distributed as a non-
Under the assumptions of the GLM, the popula- e
tion parameter vector β is estimated by b, which is central χ2 with v ¼ K degrees of freedom and non-
computed as centrality parameter
SSmodel
∼ χ2 ðν; λÞ: ð11Þ
Sums of Squares σ 2e
The total sum of squares of y is denoted SStotal , From Equations 10 and 11, it follows that the
and it is computed as ratio
SStotal ¼ yT y: ð5Þ SSmodel =σ 2e IK1
F¼ 2
×
Using Equation 2, the total sum of squares can be SSresidual =σ e K
ð12Þ
rewritten as SSmodel IK1
¼ ×
SSresidual K
y þ eÞT ð^
SStotal ¼ yT y ¼ ð^ y þ eÞ
ð6Þ is distributed as a noncentral Fisher’s F with v1 ¼
yT ^
¼^ yT e;
y þ eT e þ 2^
K and v2 ¼ I K 1 degrees of freedom and non-
yT e ¼ 0, and therefore,
but it can be shown that 2^ centrality parameter equal to
Equation 6 becomes 2 T T
λ¼ β X Xβ:
T
SStotal ¼ y y ¼ ^ T
y þ e e:
y ^ T
ð7Þ σ 2e
The first term of Equation 7 is called the model In the specific case when the null hypothesis of
sum of squares and is denoted SSModel . It is equal interest states that H0 : β ¼ 0, the noncentrality
to parameter vanishes and then the F ratio from
Equation 12 follows a standard (i.e., central)
T
SSmodel ¼ ^ y ¼ b XT Xb:
yT ^ ð8Þ Fisher’s distribution with v1 ¼ K and v2 ¼
I K 1 degrees of freedom.
The second term of Equation 7 is called the resid-
ual or the error sum of squares and is denoted
SSresidual . It is equal to
Test on Subsets of the Parameters
T T
SSresidual ¼ e e ¼ ðy XbÞ ðy XbÞ: ð9Þ Often, one is interested in testing only a subset
of the parameters. When this is the case, the I by
Sampling Distributions of the Sums of Squares
K matrix X can be interpreted as composed of
Under the assumptions of normality and i.i.d. two blocks: an I by K1 matrix X1 and an I by K2
for the error, the ratio of the residual sum of matrix X2 with K ¼ K1 þ K2 . This is expressed
SS as
squares to the error variance residual
σ2
is distributed
e
as a χ2 with a number of degrees of freedom of .
X ¼ ½X1 .. X2 : ð13Þ
v ¼ I K 1. This is abbreviated as
540 General Linear Model
Vector b is partitioned in a similar manner as When the null hypothesis is true, Fb2 | b1 follows
2 3 a Fisher’s F distribution with ν1 ¼ K2 and ν2 ¼ I
b1 K 1 degrees of freedom, and therefore, Fb2 |b1
b ¼ 4 5: ð14Þ can be used to test the null hypothesis that β2 ¼ 0:
b2
effects and the random effects. This is done with Graphs and charts are also used to enhance report-
mixed-effects models. ing and communication. Graphical displays often
Another obvious limit of the general linear provide vivid color and bring life to documents,
model is to model only linear relationships. In while also simplifying complex narrative and data.
order to include some nonlinear models (such as This entry discusses the importance of graphs,
logistic regression), the GLM needs to be extended describes common techniques for presenting data
to the class of the generalized linear models. graphically, and provides information on creating
effective graphical displays.
Hervé Abdi
Common Graphical Displays for Reporting a grid, with a line moving from left to right, on the
diagram. When several time series lines are being
Bar Chart plotted, and color is not being used, pronounced
Bar charts are one of the most commonly used symbols along the lines can help to draw attention
techniques for presenting data and are considered to the different variables. For example, a diamond
to be one of the easiest diagrams to read and inter- (t) can be used to represent all the data points for
pret. They are used to display frequency distribu- unemployment, a square (n) for job approval, and
tions for categorical variables. In bar chart so on. Another option is to use solid/dotted/dashed
displays, the value of the observation is propor- lines to distinguish different variables.
tional to the length of the bar; each category of the
variable is represented by a separate bar; and the Effective Graphical Displays
categories of the variable are generally shown
along the horizontal axis, whereas the number of The advent of commercial, feature-rich statistical
each category is shown on the vertical axis. Bar and graphical software such as Excel and IBMâ
charts are quite versatile; they can be adapted to SPSSâ (PASW) 18.0 has made the incorporation of
incorporate displays of both negative and positive professional graphical displays into reports easy and
data on the same chart (e.g., profits and losses inexpensive. (Note: IBMâ SPSSâ Statistics was for-
across years). They are particularly useful for com- merly called PASWâ Statistics.) Both Excel and SPSS
paring groups and for showing changes over time. have built-in features that can generate a wide array
Bar charts should generally not contain more than of graphical displays in mere seconds, using a few
810 categories or they will become cluttered and point-and-click operations. However, commercial
difficult to read. When more than 10 categories software has also created new problems. For exam-
are involved in data analysis, rotated bar charts or ple, some researchers may go overboard and incor-
line graphs should be considered instead. porate so many charts and graphs into their writing
that the sheer volume of diagrams can make
Pie Chart comprehension of the data torturous—rather than
enlightening—for the reader. Others may use so
A pie chart is a circle divided into sectors or many fancy features (e.g., glow, shadows) and design
slices, where the sectors of the pie are proportional shapes (e.g., cones, doughnuts, radars, cylinders) that
to the whole. The entire pie represents 100%. Pie diagrams lose their effectiveness in conveying certain
charts are used to display categorical data for a sin- information and instead become quite tedious to
gle variable. They are quite popular in journalistic read. Many readers may become so frustrated that
and business reporting. However, these charts can they may never complete reading the document.
be difficult to interpret unless percentages and/or An equally problematic issue pertains to dis-
other numerical information for each slice are torted and misleading charts and graphs. Some of
shown on the diagram. A good pie chart should these distortions may be quite deliberate. For
have no more than eight sectors or it will become example, sometimes, scales are completely omitted
too crowded. One solution is to group several from a graph. In other cases, scales may be started
smaller slices into a category called ‘‘Other.’’ When at a number other than zero. Omitting a zero tends
color is being used, red and green should not be to magnify changes. Likewise, ‘‘starting time’’ can
located on adjacent slices, because some people also affect the appearance of magnitude. An even
are color-blind and cannot distinguish red from worse scenario, however, is when either a ‘‘scale’’
green. When patterns are used, it is important to or the ‘‘starting time’’ is adjusted and then com-
ensure that optical illusions are not created on bined with a three-dimensional or other fancy
adjacent slices or the data may be misinterpreted. graph—this may lead to even greater distortion.
Other distortions may simply result from inexperi-
enced persons preparing the diagrams. The resul-
Line Graph
tant effect is that many readers who are not
A line graph shows the relationship between knowledgeable in statistics can be easily misled by
two variables by connecting the data points on such graphs. Thus, when using graphical displays,
Graphical Display of Data 543
meticulous attention should be given to ensuring assist in creating informative and effective graphi-
that the graphs do not emphasize unimportant dif- cal displays that communicate meaningful informa-
ferences and/or distort or mislead readers. tion with clarity and precision:
In order to present effective data, the researcher
must be able to identify the salient information from 1. Focus on substance—emphasize the important.
the data. In addition, the researcher must be clear on
2. Ensure that data are coherent, clear, and
what needs to be emphasized, as well as the targeted accurate.
audience for the information. The data must then be
presented in a manner that is vivid, clear, and con- 3. Use an appropriate scale that will not distort
or mislead.
cise. The ultimate goal of effective graphical displays
should be to ensure that any data communicated are 4. Label the x-axis and y-axis with appropriate
intelligible and enlightening to the targeted audience. labels to aid interpretation [e.g., Temperature
When readers have to spend a great deal of time try- (8 C); Time (minutes)].
ing to decipher a diagram, this is a clear indication 5. Number the graphs/charts, and give them an
that the diagram is ineffective. informative title (e.g., Figure 1: ABC College
All graphical displays should include source Course Enrollment 2009).
information. When graphical displays are sourced 6. Include the source at the bottom of the
entirely from other works, written permission is diagram.
required that will specify exactly how the source
7. Simplicity is often best. Use three-dimensional
should be acknowledged. If graphs are prepared
and other fancy graphs cautiously—they often
using data that are not considered proprietary, distort and/or mislead.
copyright permission need not be sought, but the
data source must still be acknowledged. When 8. Avoid stacked bar charts unless the primary
graphical displays are prepared entirely from the comparison is being made on the data series
researcher’s own data, the source information gen- located on the bottom of the bar.
erally makes reference to the technique/population 9. When names are displayed on a label (e.g.,
used to obtain the data (e.g., 2009 Survey of ABC countries, universities, etc.), alphabetize data
College Students). Source information should be before charting to aid reading.
placed at the bottom of the diagram and should be 10. Use statistical and textual descriptions
sufficiently detailed to enable the reader to go appropriately to aid data interpretation.
directly to the source (e.g., Source: General Motors
11. Use a legend when charts include more than
Annual Report, 2009, Page 10, Figure 6—
one data series. Locate legend carefully to
reprinted with permission). avoid reducing plot area.
When using graphics, many researchers often
concentrate their efforts on ensuring that the 12. Appearance is important. Consider using
salient facts are presented, while downplaying borders with curved edges and three-
appearance of displays. Others emphasize appear- dimensional effects to enhance graphical
displays. Use colors effectively and
ance over content. Both are important. Eye-catch-
consistently. Do not color every graph with
ing graphs are useless if they contain little or no a different color. Bear in mind that when
useful information. On the other hand, a graph colored documents are photocopied in black
that contains really useful content may get limited and white, images will be difficult to interpret
reading because of its appearance. Therefore, unless the original document had sharp color
researchers need to package their reports in a man- contrast. When original documents are being
ner that would be appealing to a wider mass. Effec- printed in black and white, it may be best to
tive data graphics require a combination of good use shades of black and gray or textual
statistical and graphical design skills, which some patterns.
researchers may not possess. However, numerous 13. Avoid chart clutter. This confuses and
guidelines are available on the Internet and in texts distracts the reader and can often obscure the
that can assist even a novice to create effective distribution’s shape. For example, if you are
graphs. In addition, the following guidelines can charting data for 20 years, show every other
544 Greenhouse–Geisser Correction
year on the x-axis. Angling labels may effect of the independent variable is tested by com-
create an optical illusion of less clutter. puting an F statistic, which is computed as the
14. Use readable, clear fonts (e.g., Times New ratio of the mean square of effect by the mean
Roman 10 or 12) for labels, titles, scales, square of the interaction between the subject fac-
symbols, and legends. tor and the independent variable. For a design
with S subjects and A experimental treatments,
15. Ensure that diagrams and legends are not so
small that they require a magnifying glass in
when some assumptions are met, the sampling
order to be read. distribution of this F ratio is a Fisher distribution
with v1 ¼ A 1 and v2 ¼ ðA 1ÞðS 1Þ degrees
16. Edit and scale graphs to desired size in the of freedom.
program in which they were created before In addition to the usual assumptions of
transferring into the word-processed
normality of the error and homogeneity of
document to avoid image distortions with
resizing.
variance, the F test for repeated-measurement
designs assumes a condition called sphericity.
17. Edit and format graphical displays generated Intuitively, this condition indicates that the
directly from statistical programs before using. ranking of the subjects does not change across
18. Use gridlines cautiously. They may overwhelm experimental treatments. This is equivalent
and distract if the lines are too thick. However, to stating that the population correlation
faded gridlines can be very effective on some (computed from the subjects’ scores) between
types of graphs (e.g., line charts). two treatments is the same for all pairs of
19. Use a specific reference format such as APA treatments. This condition implies that there is
style to prepare the document to ensure correct no interaction between the subject factor and
placement of graph titles, and so on. the treatment.
If the sphericity assumption is not valid, then
20. Ensure that graphical displays are self-
explanatory—readers should be able to
the F test becomes too liberal (i.e., the propor-
understand them with minimal or no reference tion of rejections of the null hypothesis is larger
to the text and tables. than the α level when the null hypothesis is true).
In order to minimize this problem, Seymour
Nadini Persaud Greenhouse and Samuel Geisser, elaborating on
early work by G. E. P. Box, suggested using
See also Bar Chart; Column Graph; Cumulative Frequency an index of deviation to sphericity to correct
Distribution; Histogram; Line Graph; Pie Chart the number of degrees of freedom of the F
distribution. This entry first presents this index
Further Readings of nonsphericity (called the Box index, denoted
ε), and then it presents its estimation and
Fink, A. (2003). How to report on surveys (2nd ed.). its application, known as the Greenhouse–
Thousand Oaks, CA: Sage. Geisser correction. This entry also presents the
Owen, F., & Jones, R. (1994). Statistics (4th ed.). Huyhn–Feldt correction, which is a more effi-
London: Pitman.
cient procedure. Finally, this entry explores tests
Pallant, J. (2001). SPSS survival manual: A step by step
guide to using SPSS. Maidenhead, Berkshire, UK:
for sphericity.
Open University Press.
Index of Sphericity
Box has suggested a measure for sphericity, denoted
GREENHOUSE–GEISSER ε, which varies between 0 and 1 and reaches the
value of 1 when the data are perfectly spherical.
CORRECTION The computation of this index is illustrated with
the fictitious example given in Table 1 with data
When performing an analysis of variance with collected from S ¼ 5 subjects whose responses were
a one-factor, repeated-measurement design, the measured for A ¼ 4 different treatments. The
Greenhouse–Geisser Correction 545
is not significant with the extreme correction but is For our example, we find that
significant with the standard number of degrees of
freedom, then use the ε correction (they recom- 2ðA 1Þ2 þ A þ 2
f ¼ ¼ 2 × 32 þ 4 þ 26 × 3 × 4
mend using ^ε; but the subsequent e ε is currently 6ðA 1ÞðS 1Þ
preferred by many statisticians). 24
¼ ¼ :33
72
Testing for Sphericity
and
One incidental question about using a correction
for lack of sphericity is to decide when a sample X2W ¼ ð1 f Þ × ðS 1Þ × lnfWg
covariance matrix is not spherical. Several tests ¼ 4ð1 :33Þ × lnf:0886g ≈ 6:46;
can be used to answer this question. The most well
known is Mauchly’s test, and the most powerful is with ν ¼ 12 4 × 3 ¼ 6, we find that p ¼ 38 and we
the John, Sugiura, and Nagao test. cannot reject the null hypothesis. Despite its rela-
tive popularity, the Mauchly test is not recom-
mended by statisticians because it lacks power. A
Mauchly’s Test more powerful alternative is the John, Sugiura,
and Nagao test for sphericity described below.
J. W. Mauchly constructed a test for sphericity
based on the following statistic, which uses the
eigenvalues of the estimated covariance matrix:
John, Sugiura, and Nagao Test
Q
λ‘ According to John E. Cornell, Dean M. Young,
W ¼ P ðA1Þ : ð8Þ
1 Samuel L. Seaman, and Roger E. Kirk, the best test
A1
λ‘
for sphericity uses V. Tables for the critical values
This statistic varies between 0 and 1 and reaches 1 of W are available in A. P. Grieve, but a good
when the matrix is spherical. For our example, we approximation is obtained by transforming V into
find that
Q 2 1 2 1
λ‘ 228 × 36 × 12 XV ¼ SðA 1Þ V : ð12Þ
W ¼ P ðA1Þ ¼ 3 2 A1
1 1
A1
λ‘ 3
ð228 þ 36 þ 12Þ
Under the null hypothesis, X2V is approximately
124; 416
¼ ≈ :0886 distributed as a χ2 distribution with
1; 404; 928 1
ν ¼ 2 AðA 1Þ 1. For our example, we find that
Tables for the critical values of W are available in
2 1 2 1
Nagarsenker and Pillai (1973), but a good approx- XV ¼ SðA 1Þ V
2 A1
imation is obtained by transforming W into
5×3 2
1
¼ 1:3379 ¼ 22:60:
X2W ¼ ð1 f Þ × ðS 1Þ × lnfWg; ð9Þ 2 3
relationships that may be discovered among codes is conducted to connect the theory with previous
and categories; they are not intended to serve as work in the field.
a checklist for matching with theoretical
constructs.
Substantive and Formal Grounded Theories
Theoretical codes or constructs, derived by
questioning the data, are used to conceptualize Two levels of grounded theory (both of which are
relationships among the codes and categories. considered to be middle-range) can be found in the
Each new level of coding requires the researcher to literature. Most are substantive theories, developed
reexamine the raw data to ensure that they are from an empirical study of social interaction in
congruent with the emerging theory. Unanswered a defined setting (such as health care, education, or
questions may identify gaps in the data and are an organization) or pertaining to a discrete experi-
used to guide subsequent interviews until the ence (such as having a particular illness, learning
researcher is no longer able to find new informa- difficult subjects, or supervising co-workers). In
tion pertaining to that construct or code. Thus, the contrast, formal theories are more abstract and
code is ‘‘saturated,’’ and further data collection focused on more conceptual aspects of social inter-
omits this category, concentrating on other issues. action, such as stigma, status passage, or negotia-
Throughout, as linkages are discovered and tion. A common way to build formal theory is by
recorded in memos, the analyst posits hypotheses the constant comparative analysis of any group of
about how the concepts fit together into an inte- substantive grounded theories that is focused on
grated theory. Hypotheses are tested against fur- a particular social variable but enacted under dif-
ther observations and data collection. The ferent circumstances, for different reasons, and in
hypotheses are not tested statistically but, instead, varied settings.
through this persistent and methodical process of
constant comparison.
Using Software for Data Analysis
Using a software program to manage these com-
Hypothesizing a Core Category
plex data can expedite the analysis. Qualitative
Eventually, a core variable that appears to data analysis programs allow the analyst to go
explain the patterns of behavior surrounding the beyond the usual coding and categorizing of data
phenomenon of interest becomes evident. This that is possible when analyzing the data by hand.
core category links most or all of the other cate- Analysts who use manual methods engage in
gories and their dimensions and properties a cumbersome process that may include highlight-
together. In most, but not all, grounded theory ing data segments with multicolored markers, cut-
studies, the core category is a BSP, an ‘‘umbrella ting up transcripts, gluing data segments onto
concept’’ that appears to explain the essence of the index cards, filing the data segments that pertain
problem for participants and how they attempt to to a particular code or category together, and
solve it. BSPs may be further subdivided into two finally sorting and re-sorting these bits of paper by
types: basic social structural processes and basic taping them on the walls. Instead, with the aid of
social psychological processes. computer programs, coding and categorizing the
At this point, however, the core category is only data are accomplished easily, and categorized data
tentative. Further interviews focus on developing segments can be retrieved readily. In addition, any
and testing this core category by trying to discount changes to these procedures can be tracked as
it. The researcher presents the theory to new parti- ideas about the data evolve. With purpose-build
cipants and/or previously interviewed participants programs (such as NVivo), the researcher is also
and elicits their agreement with the theory, further able to build and test theories and construct matri-
clarification, or refutation. With these new data, ces in order to discover patterns in the data.
the analyst can dispense with open coding and Controversy has arisen over the use of computer
code selectively for the major categories of the programs for analyzing qualitative data. Some
BSP. Once satisfied that the theory is saturated and grounded theorists contend that using a computer
explains the phenomenon, a final literature review program forces the researcher in particular
552 Grounded Theory
directions, confining the analysis and stifling crea- similar circumstances, as reflective of their own
tivity. Those who support the use of computers experience.
recognize their proficiency for managing large
amounts of complex data. Many grounded theor-
ists believe that qualitative software is particularly Modifiability
well-suited to the constant comparative method.
Finally, modifiability becomes important after
Nevertheless, prudent qualitative researchers who
the study is completed and when the theory is
use computers as tools to facilitate the examina-
applied. No grounded theory can be expected to
tion of their data continually examine their use of
account for changing circumstances. Over time,
technology to enhance, rather than replace, recog-
new variations and conditions that relate to the
nized analytical methods.
theory may be discovered, but a good BSP remains
applicable because it can be extended and qualified
Ensuring Rigor appropriately to accommodate new data and
variations.
Judging qualitative work by the positivist stan-
dards of validity and reliability is inappropriate as
these tests are not applicable to the naturalistic Developments in Grounded Theory
paradigm. Instead, a grounded theory is assessed
In the 50 years that have elapsed since grounded
according to four standards, commonly referred to
theory was first described in 1967, various
as fit, work, grab, and modifiability.
grounded theorists have developed modifications
to the method. Although Barney Glaser continues
Fit to espouse classic grounded theory method,
Anselm Strauss and Juliet Corbin introduced the
To ensure fit, the categories must be generated
conditional matrix as a tool for helping the analyst
from the data, rather than the data being forced to
to explicate contextual conditions that exert influ-
comply with preconceived categories. In reality,
ences upon the action under investigation. Using
many of the categories found in the data will be
the conditional matrix model, the analyst is cued
factors that occur commonly in everyday life.
to examine the data for the effects of increasingly
However, when such common social variables are
broad social structures, ranging from groups
found in the data, the researcher must write about
through organizations, communities, the country,
these pre-existing categories in a way that reveals
and the international relations within which the
their origin in the data. Inserting quotations from
action occurs. As new theoretical perspectives
the data into the written report is one way of doc-
came to the fore, grounded theorists adapted the
umenting fit.
methodology accordingly. For example, Kathy
Charmaz contributed a constructivist approach to
Work grounded theory, and Adele Clarke expanded into
postmodern thought with situational analysis.
To work, a theory should explain what hap-
Others have used grounded theory within feminist
pened and variation in how it happened, predict
and critical social theory perspectives. Whichever
what will happen, and/or interpret what is happen-
version a researcher chooses for conducting his or
ing for the people in the setting. Follow-up inter-
her grounded theory study, the basic tenets of
views with selected participants can be used as
grounded theory methodology continue to endure;
a check on how well the theory works for them.
conceptual theory is generated from the data by
way of systematic and simultaneous collection and
Grab analysis.
Grab refers to the degree of relevance that the P. Jane Milliken
theory and its core concept have to the topic of
the study. That is, the theory should be immedi- See also Inference: Deductive and Inductive; Naturalistic
ately recognizable to participants and others in Inquiry; NVivo; Qualitative Research
Group-Sequential Designs in Clinical Trials 553
at the kth visit, and it is assumed that n1 and n2 Wang and Tsiatis’s Test
are even.
This method also uses the Zk from Step 1 in
Pocock’s test. In Step 2, Zk is compared with a crit-
ical value CWT ðK; α, Þðk/KÞ1=2 . Refer to Sam-
Pocock’s Test
ple Size Calculations in Clinical Trial Research, by
The method consists of two steps: Shein-Chung Chow, Jun Shao, and Hansheng
Wang, for the table of critical values. Pocock’s test
1. Calculate Zk at each visit k where k ¼ 1; . . . ; and O’Brien and Fleming’s test are considered to
K: be special cases of Wang and Tsiatis’s test.
nk nk
! The calculation of the required sample size per
1 X X
Zk ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x1j x2j : treatment group at each interim visit is formulated
nk ðσ 21 þ σ 22 Þ j¼1 j¼1 as follows:
2 !
2. Compare Zk to a critical value Cp ðK; α). Zα=2 þ Zβ σ 21 þ σ 22
n ¼ RWT ðK; α; βÞ =K
ðμ1 μ2 Þ2
At any interim visit k prior to the final visit K;
if |Zk | > Cp ðK; α), then stop the trial to conclude
that there is evidence that one treatment is supe- For calculation of critical values, see the Further
rior to the other. Otherwise, continue to collect the Readings section.
assessments. The critical values Cp ðK; α) are avail- Abdus S. Wahed and Sachiko Miyahara
able in standard textbooks or statistical software
packages. See also Cross-Sectional Design; Internal Validity
The required sample size per treatment group at Longitudinal Design; Sequential Design
each interim visit is calculated as follows:
2 ! Further Readings
Zα=2 þ Zβ σ 21 þ σ 22
n ¼ Rp ðK; α; βÞ =K Chow, S., Shao, J., & Wang, H. (2008). Sample size
ðμ1 μ2 Þ2
calculations in clinical trial research. Boca Raton, FL:
Chapman and Hall.
where σ 21 and σ 22 are the variances of the continu-
Jennison, C., & Turnbull, B. (2000). Group sequential
ous responses from Treatment Group 1 and 2, methods with applications to clinical trials. New York:
respectively. Similarly, μ1 and μ2 are the means of Chapman and Hall.
the responses from the two treatment groups.
Probability
Assessing the longitudinal course of depression and 0.6
economic integration of South-East Asian refugees: An
application of latent growth curve analysis. 0.4
International Journal of Methods in Psychiatric
0.2
Research, 11, 154168.
0
−3 −2 −1 0 1 2 3
θ
GUESSING PARAMETER
Figure 1 Item Characteristic Curve
In item response theory (IRT), the guessing param-
eter is a term informally used for the lower asymp-
tote parameter in a three-parameter-logistic (3PL) function is called an item characteristic curve
model. Among examinees who demonstrate very (ICC) or item response function (IRF). The value
low levels of the trait or ability measured by the of the lower asymptote or guessing parameter in
test, the value of the guessing parameter is the Figure 1 is 0.2, so as θ becomes infinitely low, the
expected proportion that will answer the item cor- probability of a correct response approaches 0.2.
rectly or endorse the item in the scored direction. Guessing does not necessarily mean random
This can be understood more easily by examining guessing. If the distractors function effectively, the
the 3PL model: correct answer should be less appealing than the
distractors and thus would be selected less than
e1:7ai ðθbi Þ would be expected by random chance. The lower
PðθÞ ¼ ci þ ð1 ci Þ ; ð1Þ asymptote parameter would then be less than
1 þ e1:7ai ðθbi Þ
1/number of options. Frederic Lord suggested that
where θ is the value of the trait or ability; P(θ) is this is often the case empirically for large-scale
the probability of correct response or item tests. Such tests tend to be well-developed, with
endorsement, conditional on θ; ai is the slope or items that perform poorly in pilot testing discarded
discrimination for item i; bi is the difficulty or before the final forms are assembled. In more typi-
threshold for item i; and ci is the lower asymptote cal classroom test forms, one or more of the dis-
or guessing parameter for item i: Sometimes, the tractors may be implausible or otherwise not
symbol g is used instead of c: In Equation 1, as θ function well. Low-ability examinees may guess
decreases relative to b; the second term approaches randomly from a subset of plausible distractors,
zero and thus the probability approaches c: If it is yielding a lower-asymptote greater than 1/number
reasonable to assume that the proportion of exam- of options. This could also happen when there is
inees with very low θ who know the correct a clue to the right answer, such as the option
answer is virtually zero, it is reasonable to assume length. The same effect would occur if examinees
that those who respond correctly do so by gues- can reach the correct answer even by using faulty
sing. Hence, the lower asymptote is often labeled reasoning or knowledge. Another factor is
the guessing parameter. Figure 1 shows the proba- that examinees who guess tend to choose middle
bilities from Equation 1 plotted across the range of response options; if the correct answer is B or C,
θ, for a ¼ 1:5; b ¼ 0:5, and c ¼ 0:2. The range of θ the probability of a correct response by guessing
is infinite, but the range chosen for the plot was would be higher than if the correct answer were A
3 to þ3 because most examinees or respondents or D.
would fall within this range if the metric were set Because the lower asymptote may be less than
such that θ had a mean of 0 and standard devia- or greater than random chance, it may be speci-
tion of 1 (a common, though arbitrary, way of fied as a parameter to be freely estimated, per-
defining the measurement metric in IRT). This haps with constraints to keep it within
558 Guttman Scaling
Table 1 The Pattern of Responses of a Perfect to the number of nonzero variables (i.e., columns
Guttman Scale in Table 1) for this row.
Problems
The previous quantifying scheme assumes that
the differences in difficulty are the same between
Children Counting Addition Subtraction Multiplication Division all pairs of contiguous operations. In real applica-
S1 1 0 0 0 0
tions, it is likely that these differences are not the
S2 1 1 0 0 0
same. In this case, a way of estimating the size of
S3 1 1 1 0 0
the difference between two contiguous operations
S4 1 1 1 1 0
is to consider that this difference is inversely
S5 1 1 1 1 1
proportional to the number of children who solved
a given operation (i.e., an easy operation is sol-
Note: A value of 1 means that the child (row) has mastered ved by a large number of children, a hard one is
the type of problem (column); a value of 0 means that the solved by a small number of children).
child has not mastered the type of problem.
to find, however, are children, for example, who How to Order the Rows
have mastered division but who have not mastered of a Matrix to Find the Scale
addition or subtraction or multiplication. So, the
When the Guttman model is valid, there are
set of patterns of responses that we expect to find
multiple ways of finding the correct order of the
is well structured and is shown in Table 1. The
rows and the columns that will give the format of
pattern of data displayed in this table is consistent
the data as presented in Table 1. The simplest
with the existence of a single dimension of mathe-
approach is to reorder rows and columns accord-
matical ability. In this framework, a child has
ing to their marginal sum. Another theoretically
reached a certain level of this mathematical ability
interesting procedure is to use correspondence
and can solve all the problems below this level and
analysis (which is a type of factor analysis tailored
none of the problems above this level.
for qualitative data) on the data table; then, the
When the data follow the pattern illustrated in
coordinates on the first factor of the analysis will
Table 1, the rows and the columns of the table
provide the correct ordering of the rows and the
both can be represented on a single dimension.
columns.
The operations will be ordered from the easiest to
the hardest, and a child will be positioned on the
right of the most difficult type of operation solved. Imperfect Scale
So the data from Table 1 can be represented by the
In practice, it is rare to obtain data that fit a Gutt-
following order:
man scaling model perfectly. When the data do
Counting—S1 —Addition—S2 —Subtraction— not conform to the model, one approach is to
relax the unidimensionality assumption and
S3 —Multiplication—S4 —Division—S5 assume that the underlying model involves several
ð1Þ dimensions. Then, these dimensions can be
obtained and analyzed with multidimensional
This order can be transformed into a set of techniques such as correspondence analysis (which
numerical values by assigning numbers with equal can be seen as a multidimensional generalization
steps between two contiguous points. For example, of Guttman scaling) or multidimensional scaling.
this set of numbers can represent the numerical Another approach is to consider that the devia-
values corresponding to Table 1: tions from the ideal scale are random errors. In
this case, the problem is to recover the Guttman
Counting S1 Addition S2 Subtraction S3 Multiplication S4 Division S5 scale from noisy data. There are several possible
1 2 3 4 5 6 7 8 9 10
ways to fit a Guttman scale to a set of data. The
simplest method (called the Goodenough–Edwards
This scoring scheme implies that the score of an method) is to order the rows and the columns
observation (i.e., a row in Table 1) is proportional according to their marginal sum. An example of
560 Guttman Scaling
Table 2 An Imperfect Guttman Scale of its CR is equal to or larger than .90. In practice,
it is often possible to improve the CR of a scale by
Problems
eliminating rows or columns that contain a large
Children Counting Addition Subtraction Multiplication Division Sum proportion of errors. However, this practice may
C1 1 0 0 0 0 1 also lead to capitalizing on random errors and
C2 1 0 *
1 *
0 0 2 may give an unduly optimistic view of the actual
C3 1 1 1 0 0 3 reproducibility of a scale.
C4 1 1 0* 1 0 3
Hervé Abdi
C5 1 1 1 1 1 5
Sum 5 3 3 2 1 – See also Canonical Correlation Analysis; Categorical
*
Notes: Values with an asterisk ( ) are considered errors. Variable; Correspondence Analysis; Likert Scaling;
Compare with Table 1 showing a perfect scale. Principal Components Analysis; Thurstone Scaling
General Editor
Neil J. Salkind
University of Kansas
Associate Editors
Bruce B. Frey
University of Kansas
Donald M. Dougherty
University of Texas Health Science Center at San Antonio
Managing Editors
All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the
publisher.
For information:
SAGE Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: order@sagepub.com
HA29.E525 2010
001.403—dc22 2010001779
10 11 12 13 14 10 9 8 7 6 5 4 3 2 1
Volume 2
List of Entries vii
Entries
H 561 M 745
I 589 N 869
J 655 O 949
K 663 P 985
L 681
List of Entries
vii
viii List of Entries
561
562 Hawthorne Effect
purpose of the study. Specifically, their informed the research participants to interview them about
consent form also indicated that they would be their current level of loneliness. In interpreting the
taking part in a research study investigating findings, there are several potential reactivity
patient acceptability of the side effects of anes- effects to consider in this design. First, failure to
thesia. The researchers then examined postoper- find any differences between the two groups could
ative changes in psychological well-being and be attributed to the attention that both groups
physical complaints (e.g., nausea, vomiting, and received from the researcher at the monthly
pain) in the two groups. Consistent with the visits—that is, reacting to the attention and knowl-
Hawthorne effect, participants who received the edge that someone is interested in improving their
additional information indicating that they were situation (the Hawthorne effect). Conversely, reac-
part of a research study reported significantly tivity also could be applied to finding significant
better postoperative psychological and physical differences between the two groups. For example,
well-being than participants who were not the participants who received the pet might report
informed of the study. Similar to the conclusions less loneliness in an effort to meet the experimen-
drawn at the Hawthorne Works manufacturing ter’s expectations (experimenter effects). These are
plant, researchers in the knee surgery study but two of the many possible reactivity effects to
noted that a positive response accompanied sim- be considered in this study.
ply knowing that one was being observed as part In sum, many potential factors can impede
of research participation. accurate interpretation of study findings. Reactiv-
ity effects represent one important area of consid-
eration when designing a study. It is in the best
Threat to Internal Validity
interest of the researcher to safeguard against reac-
The Hawthorne effect represents one specific type tivity effects to the best of their ability in order to
of reactivity. Reactivity refers to the influence that have a greater degree of confidence in the internal
an observer has on the behavior under observation validity of their study.
and, in addition to the Hawthorne effect, includes
experimenter effects (the tendency for participants
How to Reduce Threat to Internal Validity
to change their behavior to meet the expectation
of researchers), the Pygmalion effect (the tendency The Hawthorne effect is perhaps the most chal-
of students to change their behavior to meet lenging threat to internal validity for researchers to
teacher expectations), and the Rosenthal effect control. Although double-blind studies (i.e., studies
(the tendency of individuals to internalize the in which neither the research participant nor the
expectations, whether good or bad, of an authority experimenter are aware to which intervention they
figure). Any type of reactivity poses a threat to are assigned) control for many threats to internal
interpretation about the relationships under inves- validity, double-blind research designs do not
tigation in a research study, otherwise known as eliminate the Hawthorne effect. Rather, it just
internal validity. Broadly speaking, the internal makes the effect equal across groups given that
validity of a study is the degree to which changes everyone knows they are in a research study and
in outcome can be attributed to something the that they are being observed. To help mitigate the
experimenter intended rather than attributed to Hawthorne effect, some have suggested a special
uncontrolled factors. For example, consider a study design employing what has been referred to
in which a researcher is interested in the effect of as a Hawthorne control. This type of design
having a pet on loneliness among the elderly. Spe- includes three groups of participants: the control
cifically, the researcher hypothesizes that elderly group who receives no treatment, the experimental
individuals who have a pet will report less loneli- group who receives the treatment of interest to
ness than those who do not. To test that relation- the researchers, and the Hawthorne control
ship, the researcher randomly assigns the elderly who receives a treatment that is irrelevant to
participants to one of two groups: one group the outcome of interest to the experimenters. For
receives a pet and the other group does not. The instance, consider the previous example regarding
researcher schedules monthly follow-up visits with the effect of having a pet on loneliness in the
Heisenberg Effect 563
elderly. In that example, the control group would Hawthorne studies formed the foundation for the
not receive a pet, the experimental group would development of a branch of psychology known as
receive a pet, and the Hawthorne control group industrial/organizational psychology. This particu-
would receive something not expected to impact lar branch focuses on maximizing the success of
loneliness such as a book about pets. Thus, if the organizations and of groups and individuals within
outcome for the experimental group is significantly organizations. The outcomes of the Hawthorne
different from the outcome of the Hawthorne con- Works research led to an emphasis on the impact
trol group, one can reasonably argue that the spe- of leadership styles, employee attitudes, and inter-
cific experimental manipulation, and not simply personal relationships on maximizing productivity,
the knowledge that one is observed, resulted in the an area known as the human relations movement.
group differences.
Lisa M. James and Hoa T. Vo
and behavioral sciences, it is actually misleading. by the time the photons have traveled the immense
For reasons discussed in this entry, some argue it number of light years to reach the observer’s tele-
should more properly be called the observer effect. scopes, photometers, and spectroscopes.
In addition, this effect is examined in relation to Observer effects permeate many different kinds
other concepts and effects. of research in the behavioral and social sciences. A
famous example in industrial psychology is the
Hawthorne effect whereby the mere change
Observer Effect
in environmental conditions can induce a
The observer effect can be found in almost any sci- temporary—and often positive—alteration in per-
entific discipline. A commonplace example is taking formance or behavior. A comparable illustration in
the temperature of a liquid. This measurement educational psychology is the Rosenthal or
might occur by inserting a mercury-bulb thermome- ‘‘teacher-expectancy’’ effect in which student per-
ter into the container and then reading the formance is enhanced in response to a teacher’s
outcome on the instrument. Yet unless the expectation of improved performance. In fact, it is
thermometer has exactly the same temperature as difficult to conceive of a research topic or method
the liquid, this act will alter the liquid’s post- that is immune from observer effects. They might
measurement temperature. If the thermometer’s intrude on laboratory experiments, field experi-
temperature is warmer, then the liquid will be ments, interviews, and even ‘‘naturalistic’’ observa-
warmed, but if the thermometer’s temperature is tions—the quotes added because the observations
cooler, then the liquid will be cooled. Of course, cease to be natural to the extent that they are
the magnitude of the measurement contamination contaminated by observer effects. In ‘‘participant
will depend on the temperature discrepancy observation’’ studies, the observer most likely
between the instrument and the liquid. The con- alters the observed phenomena to the very
tamination also depends on the relative amount of degree that he or she actively participates.
material involved (as well as on the specific heat Needless to say, observer effects can seriously
capacities of the substances). The observer effect of undermine the validity of the measurement in the
measuring the temperature of saline solution in behavioral and social sciences. If the phenomenon
a small vial is far greater than using the same ther- reacts to assessment, then the resulting score might
mometer to assess the temperature of the Pacific not closely reflect the true state of the case at time
Ocean. of measurement. Even so, observer effects are not
As the last example implies, the observer effect all equivalent in the magnitude of their interfer-
can be negligible and, thus, unimportant. In some ence. On the one hand, participants in laboratory
cases, it can even be said to be nonexistent. If experiments might experience evaluation appre-
a straight-edge ruler is used to measure the length hension that interferes with their performance on
of an iron bar, under most conditions, it is unlikely some task, but this interference might be both
that the bar’s length will have been changed. Yet small and constant across experimental conditions.
even this statement is contingent on the specific The repercussions are thus minimal. On the
conditions of measurement. For instance, suppose other hand, participants might respond to cer-
that the goal was to measure in situ the length of tain cues in the laboratory setting—so-called
a bar found deep within a subterranean cave. demand characteristics—by deliberately behav-
Because that measurement would require the ing in a manner consistent with their perception
observer to import artificial light and perhaps even of the experimenter’s hypothesis. Such artificial
inadvertent heat from the observer’s body, the bar’s (even if accommodating) behavior can render the
dimension could slightly increase. Perhaps the only findings scientifically useless.
natural science in which observer effects are com- Sometimes researchers can implement proce-
pletely absent is astronomy. The astronomer can dures in the research design that minimize observer
measure the attributes of a remote stellar object, effects. A clear-cut instance are the double-blind
nebula, or galaxy without any fear of changing the trials commonly used in biomedical research.
phenomenon. Indeed, as in the case of supernovas, Unlike single-blind trials where only the partici-
the entity under investigation might no longer exist pant is ignorant of the experimental treatment,
Heisenberg Effect 565
double-blind trials ensure that the experimenter is future events were totally fixed by the prior distri-
equally unaware. Neither the experimenter nor the butions and properties of matter and energy, real-
participant knows the treatment condition. Such ity became much more unpredictable. Two
double-blind trials are especially crucial in avoid- quantum ideas were especially crucial to the con-
ing the placebo effect, a contaminant that might cept of the observer effect.
include an observer effect as one component. If the The first is the idea of superimposition. Accord-
experimenter is confident that a particular medi- ing to quantum theory, it is possible for an entity
cine will cure or ameliorate a patient’s ailment or to exist in all available quantum states simulta-
symptoms, that expectation alone can improve the neously. Thus, an electron is not in one particular
clinical outcomes. state but in multiple states described by a probabil-
Another instance where investigators endeavor ity distribution. Yet when the entity is actually
to reduce observer effects is the use of deception in observed, it can only be in one specific state. A
laboratory experiments, particularly in fields like classic thought experiment illustrating this phe-
social psychology. If research participants know nomenon is known as Schrödinger’s cat, a creature
the study’s purpose right from the outset, their placed in a box with poison that would be admin-
behavior will probably not be representative of istered contingent on the state of a subatomic par-
how they would act otherwise. So the participants ticle. Prior to observation, a cat might be either
are kept ignorant, usually by being deliberately alive or dead, but once it undergoes direct observa-
misled. The well-known Milgrim experiment tion, it must occupy just one of these two states. A
offers a case in point. To obtain valid results, the minority of quantum theorists have argued that it
participants had to be told that (a) the investigator is the observation itself that causes the superim-
was studying the role of punishment on pair- posed states to collapse suddenly into just a single
associate learning, (b) the punishment was being state. Given this interpretation, the result can be
administered using a device that delivered real considered an observer effect. In a bizarre way, if
electric shocks, and (c) the learner who was receiv- the cat ends up dead, then the observer killed it by
ing those shocks was experiencing real pain and destroying the superimposition! Nonetheless, the
was suffering from a heart condition. All three of majority of theorists do not accept this view.
these assertions were false but largely necessary The very nature of observation or measurement
(with the exception of the very last deception). in the micro world of quantum physics cannot
A final approach to avoiding observer effects is have the same meaning as in the macro world of
to use some variety of unobtrusive or nonreactive everyday Newtonian physics.
measures. One example is archival data analysis, The second concept is closest to the source of
such as content analysis and historiometry. When the term, namely, the 1927 Heisenberg uncertainty
the private letters of suicides are content analyzed, principle. Named after the German physicist Wer-
the act of measurement cannot alter the phenome- ner Heisenberg, this rule asserts that there is a defi-
non under investigation. Likewise, when historio- nite limit to how precisely both the momentum
metric techniques are applied to biographical and the position of a given subatomic particle,
information about eminent scientists, that applica- such as an electron, can simultaneously be mea-
tion leaves no imprint on the individuals being sured. The more the precision is increased in the
studied. measurement of momentum, the less precise will
be the concurrent measurement of that particle’s
position, and conversely. Stated differently, these
Quantum Physics
two particle attributes have linked probability dis-
The inspiration for the term Heisenberg effect tributions so that if one distribution is narrowed,
originated in quantum physics. Early in the 20th the other is widened. In early discussions of the
century, quantum physicists found that the behav- uncertainly principle, it was sometimes argued that
ior of subatomic particles departed in significant, this trade-off was the upshot of observation. For
even peculiar, ways from the ‘‘billiard ball’’ models instance, to determine the location of an electron
that prevailed in classical (Newtonian) physics. requires that it be struck with a photon, but that
Instead of a mechanistic determinism in which very collision changes the electron’s momentum.
566 Heisenberg Effect
Nevertheless, as in the previous case, most quan- former the observer directly affects the participants
tum theorists perceive the uncertainty as being in the original study, in the latter, it is the original
inherent in the particle and its entanglement with study’s results that affect the participants in a later
the environment. Position and momentum in partial or complete replication. Furthermore, for
a strict sense are concepts in classical physics that good or ill, there is little evidence that enlighten-
again do not mean the same thing in quantum ment effects actually occur. Even the widely publi-
physics. Indeed, it is not even a measurement issue: cized Milgrim experiment was successfully
The uncertainty principle applies independent of replicated many decades later.
the means by which physicists attempt to assess An example of a divergent concept is also the
a particle’s properties. There is no way to improve most superficially similar: observer bias. This
measurement so as to lower the degree of uncer- occurs when the characteristics of the observer
tainty below a set limit. influence how data are recorded or analyzed.
In short, the term Heisenberg effect has very lit- Unlike the observer effect, the observer bias occurs
tle, if any, relation with the Heisenberg uncertainty in the researcher rather than in the participant.
principle—or for that matter any other idea that The first prominent example in the history of sci-
its originator contributed to quantum physics. Its ence appeared in astronomy. Astronomers observ-
usage outside of quantum physics is comparable ing the exact same event—such as the precise time
with that of using Einstein’s theory of relativity to a star crossed a line in a telescope—would often
justify cultural relativism in the behavioral and give consistently divergent readings. Each astrono-
social sciences. Behavioral and social scientists are mer had a ‘‘personal equation’’ that added or sub-
merely borrowing the prestige of physics by adopt- tracted some fraction of a second to the correct
ing an eponym, yet in doing so, they end up for- time (defined as the average of all competent
feiting the very conceptual precision that grants observations). Naturally, if observer bias can occur
physics more status. For this reason, some argue in such a basic measurement, it can certainly
that it would probably be best if the term Heisen- infringe on the more complex assessments that
berg effect was replaced with the term observer appear in the behavioral and social sciences.
effect. Hence, in an observational study of aggressive
behavior on the playground, two independent
researchers might reliably disagree in what mutu-
Related Concepts
ally observed acts can be counted as instances of
The observer effect can be confused with other aggression. Prior training of the observers might
ideas besides the Heisenberg uncertainty principle. still not completely remove these personal biases.
Some of these concepts are closely related, and Even so, to the degree that observer bias does not
others are not. affect the overt behavior of the children being
An instance of the former is the phenomenon observed, it cannot be labeled as an observer
that can be referred to as the enlightenment effect. effect.
This occurs when the result of scientific research
becomes sufficiently well known that the finding Dean Keith Simonton
renders itself obsolete. In theory the probability of
See also Experimenter Expectancy Effect; Hawthorne
replicating the Milgrim obedience experiment
Effect; Interviewing; Laboratory Experiments; Natural
might decline as increasingly more potential
Experiments; Naturalistic Observation; Observational
research participants become aware of the results
Research; Rosenthal Effect; Validity of Measurement
of the original study. Although enlightenment
effects could be a positive benefit with respect to
social problems, they would be a negative cost Further Readings
from a scientific perspective. Science presumes the Chiesa, M., & Hobbs, S. (2008). Making sense of social
accumulation of knowledge, and knowledge can- research: How useful is the Hawthorne effect?
not accumulate if findings cannot be replicated. European Journal of Social Psychology, 38, 6774.
Still, it must be recognized that the observer and Orne, M. T. (1962). On the social psychology of the
enlightenment effects are distinct. Where in the psychological experiment: With particular reference to
Hierarchical Linear Modeling 567
demand characteristics and their implications. Table 1 Parameter Estimates for Different Models
American Psychologist, 17, 776783. Based on the Full Information Maximum
Rosenthal, R. (2002). The Pygmalion effect and its Likelihood (FIML) Estimation
mediating mechanisms. San Diego, CA: Academic
Model A Model B Model C
Press.
Sechrest, L. (2000). Unobtrusive measures. Washington, Fixed-effect estimate
DC: American Psychological Association. Intercept (γ 00) 12.64 12.67 13.15
(SEy ) (.24) (.19) (.21)
SESCij (γ 10) 2.40 2.55
(SEy ) (.12) (.14)
HIERARCHICAL LINEAR MODELING Himintyj (γ 01) — 1.86
(SEy ) (.40)
Hierarchical linear modeling (HLM, also known SESCij*Himintyj (γ 11) — .57
as multilevel modeling) is a statistical approach (SEy ) (.25)
for analyzing hierarchically clustered observa-
tions. Observations might be clustered within Variance estimate of
experimental treatment (e.g., patients within the random effect
group treatment conditions) or natural groups τ 00 8.55 4.79 4.17
(e.g., students within classrooms) or within indi- τ 11 .40 .34
viduals (repeated measures). HLM provides τ 10 (or τ01) .15ns .35ns
proper parameter estimates and standard errors σ2 39.15 36.83 36.82
for clustered data. It also capitalizes on the hier-
archical structure of the data, permitting Deviance statistic (D) 47,113.97 46,634.63 46,609.06
researchers to answer new questions involving Number of parameters 3 6 8
the effects of predictors at both group (e.g., class Notes:y Standard error of fixed effects in parentheses. All
size) and individual (e.g., student ability) levels. parameter estimates are significant at p < .05 unless marked
Although the focus here is on two-level models ns
. Model A contains no predictor. Model B has SESCij as
with continuous outcome variables, HLM can be predictor, and Model C has both SESCij and Himintyj as
extended to other forms of data (e.g., binary predictors.
variables, counts) with more than two levels of
clustering (e.g., student, classroom, and school). schools) affects the level of achievement. In HLM,
The key concepts in HLM are illustrated in this separate sets of regression equations are written at
entry using a subsample of a publically accessi- the individual (level 1) and group (level 2) levels of
ble data set based on the 1982 High School and analysis.
Beyond (HS&B) Survey. The partial HS&B data
set contains a total of 7,185 students nested
within 160 high schools, which is included in the Level 1 (Student-Level) Model
free student version of HLM available from Sci- MathAchij ¼ β0j þ eij ; ð1Þ
entific Software International, Inc. Mathematics
achievement (MathAch) will be used as the out-
come variable in a succession of increasingly where i represents each student and j represents
complex models. The results of Models A and B each school. Note that no predictors are included
discussed here were reported by Stephen Rau- in Equation 1. β0j is the mean MathAch score for
denbush and Anthony Bryk in their HLM text. school j: eij is the within-school residual that cap-
tures the difference between individual MathAch
score and the school mean MathAch. eij is
Some Important Submodels assumed to be normally distributed, and the vari-
ance of eij is assumed to be homogeneous across
Model A: Random-Intercepts Model
schools [i.e., eij ∼ N(0, σ 2 Þ for all 160 schools]. As
The random-intercepts model is the simplest presented in Table 1 (Model A), the variance of eij
model in which only group membership (here, is equal to σ 2 ¼ 39.15.
568 Hierarchical Linear Modeling
plotted against the observed Mahalanobis dis- consists of interval- or ratio-level data and is usu-
tances across all clusters. ally displayed on the abscissa (x-axis), and the fre-
quency data on the ordinate (y-axis), with the
height of the bar proportional to the count. If the
Diagnostics data for the independent variable are put into
‘‘bins’’ (e.g., ages 04, 59, 1014, etc.), then the
Toby Lewis and colleague proposed a top-down width of the bar is proportional to the width of
procedure (i.e., from highest level to lowest level) the bin. Most often, the bins are of equal size, but
using diagnostic measures such as leverage, inter- this is not a requirement. A histogram differs from
nally and externally Studentized residuals, and a bar chart in two ways. First, the independent
DFFITS for different level observations, which are variable in a bar chart consists of either nominal
analogous to diagnostic measures commonly used (i.e., named, unordered categories, such as reli-
in OLS regression. gious affiliation) or ordinal (ranks or ordered cate-
Oi-Man Kwok, Stephen G. West, and Ehri Ryu gories, such as stage of cancer) data. Second, to
emphasize the fact that the independent variable is
See also Fixed-Effects Models; Intraclass Correlation; not continuous, the bars in a bar chart are sepa-
Multilevel Modeling; Multiple Regression; Random- rated from one another, whereas they abut each
Effects Models other in a histogram. After a bit of history, this
entry describes how to create a histogram and then
discusses alternatives to histograms.
Further Readings
de Leeuw, J., & Meijer, E. (2008). Handbook of
multilevel analysis. New York: Springer.
A Bit of History
Hox, J. J. (2002). Multilevel analysis: Techniques and The term histogram was first used by Karl Pearson
applications. Mahwah, NJ: Lawrence Erlbaum. in 1895, but even then, he referred to it as a ‘‘com-
Kreft, I. G. G., & de Leeuw, J. (1998). Introducing
mon form of graphical representation,’’ implying
multilevel modeling. Thousand Oaks, CA: Sage.
Lewis, T., & Langford, I. H. (2001). Outliers, robustness that the technique itself was considerably older.
and the detection of discrepant data. In A. H. Leyland Bar charts (along with pie charts and line graphs)
& H. Goldstein (Eds.), Multilevel modelling of health were introduced over a century earlier by William
statistics (pp. 7591). New York: Wiley. Playfair, but he did not seem to have used histo-
O’Connell, A. A., & McCoach, D. B. (2008). Multilevel grams in his books.
modeling of educational data. Charlotte, NC:
Information Age Publishing.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Creating a Histogram
linear models: Applications and data analysis methods
Consider the hypothetical data in Table 1, which
(2nd ed.). Thousand Oaks, CA: Sage.
tabulates the number of hours of television
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel
analysis: An introduction to basic and advanced watched each week by 100 respondents. What is
multilevel modeling. Thousand Oaks, CA: Sage. immediately obvious is that it is impossible to
comprehend what is going on. The first step in try-
ing to make sense of these data is to put them in
Websites rank order, from lowest to highest. This says that
Scientific Software International, Inc. HLM 6, Student the lowest value is 0 and the highest is 64, but it
Version: http://www.ssicentral.com/hlm/student.html does not yield much more in terms of understand-
ing. Plotting the raw data would result in several
problems. First, many of the bars will have heights
of zero (e.g., nobody reported watching for one,
HISTOGRAM two, or three hours a week), and most of the other
bars will be only one or two units high (i.e., the
A histogram is a method that uses bars to display number of people reporting that specific value).
count or frequency data. The independent variable This leads to the second problem, in that it makes
572 Histogram
Table 1 Fictitious Data on How Many Hours of often, the answer is somewhere between 6 and 15,
Television Are Watched Each Week by 100 with the actual number depending on two consid-
People erations. The first is that the bin size should be an
Subjects Data easily comprehended size. Thus, bin sizes of 2, 5,
15 41 43 14 35 31 10, or 20 units are recommended, whereas those
610 39 22 9 32 49 of 3, 7, or 9 are not. The second consideration is
1115 12 27 53 7 23 esthetics; the graph should get the point across and
1620 29 22 22 26 14 not look too cluttered.
2125 33 34 12 13 16 Although several formulas have been proposed
2630 34 25 40 5 41 to determine the width of the bins, the simplest is
3135 43 30 40 44 12 arguably the most useful. It is the range of the
3640 55 14 25 32 10 values (largest minus smallest) divided by the
4145 30 28 25 23 0 desired number of bins. For these data, the range
4650 56 24 17 15 33 is 64, and if 10 bins are desired, it would lead to
5155 30 15 29 20 14 a bin width of 6 or 7. Because these are not widths
5660 40 26 24 34 49 that are easy to comprehend, the closest compro-
6165 50 26 13 36 47 mise would be 5. Table 2 shows the results of put-
6670 19 9 64 35 33 ting the data in bins of five units each. The first
7175 35 39 9 25 41 column lists the values included in each bin; the
7680 5 18 54 11 59 second column provides the midpoint of the bin;
8185 36 36 37 52 29 and the third column summarizes the number of
8690 24 22 41 36 31 people in each bin. The last column, which gives
9195 32 10 50 45 23 the cumulative total for each interval, is a useful
96100 24 15 5 20 52 check that the counting was accurate.
However, there is a price to pay for putting the
data into bins, and it is that some information is
Table 2 The Data in Table 1 Grouped into Bins lost. For example, Table 2 shows that seven people
watched between 15 and 19 hours of television
Interval Midpoint Count Cumulative total
per week, but the exact amounts are now no lon-
04 2 1 1 ger known. In theory, all seven watched 17 hours
59 7 7 8 each. In reality, only one person watched 17 hours,
1014 12 12 20 although the mean of 17 for these people is rela-
1519 17 7 27 tively accurate. The larger the bin width, the more
2024 22 13 40 information that is lost.
2529 27 12 52 The scale on the y-axis should allow the largest
3034 32 14 66 number in any of the bins to be shown, but again
3539 37 10 76 it should result in divisions that are easy to grasp.
4044 42 10 86 For example, the highest value is 14, but if this
4549 47 4 90 were chosen as the top, then the major tick marks
5054 52 6 96 would be at 0, 7, and 14, which is problematic for
5559 57 3 99 the viewer. It would be better to extend the y-axis
6064 62 1 100 to 15, which will result in tick marks every five
units, which is ideal. Because the data being plot-
ted are counts or frequencies, the y-axis most often
it difficult to discern any pattern. Finally, the starts at zero. Putting all this together results in the
x-axis will have many values, again interfering histogram in Figure 1.
with comprehension. The exception to the rule of the y-axis starting
The solution is to group the data into mutually at zero is when all of the bars are near the top of
exclusive and collectively exhaustive classes, or the graph. In this situation, small but important
bins. The issue is how many bins to use. Most differences might be hidden. When this occurs, the
Holm’s Sequential Bonferroni Procedure 573
10
asymmetry could be a warning not to use certain
statistical tests with these data, if the tests assume
that the data are normally distributed. Finally, the
graph can easily show whether the data are unimo-
5
dal, bimodal, or have more than two peaks that
stand out against all of the other data.
0
2 12 22 32 42 52 62 Alternatives to Histograms
Number of Hours
One alternative to a histogram is a frequency poly-
gon. Instead of drawing bars, a single point is
Figure 1 Histogram Based on Table 2 placed at the top of the bin, corresponding to its
midpoint, and the points are connected with lines.
The only difference between a histogram and a fre-
100
quency polygon is that, by convention, an extra
bin is placed at the upper and lower ends with
Number of People
bottom value should still be zero, and there would Further Readings
be a discontinuity before the next value. It is
important, though, to flag this for the viewer, Cleveland, W. S. (1985). The elements of graphing data.
by having a break within the graph itself, as in Pacific Grove, CA: Wadsworth.
Figure 2. Robbins, N. B. (2005). Creating more effective graphs.
Hoboken, NJ: Wiley.
The histogram is an excellent way of displaying
Streiner, D. L., & Norman, G. R. (2007). Biostatistics:
several attributes about a distribution. The first is The bare essentials (3rd ed.). Shelton, CT: People’s
its shape—is it more or less normal, or rectangular, Medical Publishing House.
or does it seem to follow a power function, with
counts changing markedly at the extremes? The
second attribute is its symmetry—is the distribu-
tion symmetrical, or are there very long or heavy HOLM’S SEQUENTIAL
tails at one end? This is often seen if there is a natu-
ral barrier at one end, beyond which the data can- BONFERRONI PROCEDURE
not go, but no barrier at the other end. For
example, in plotting length of hospitalization or The more statistical tests one performs the more
time to react to some stimulus, the barrier at the likely one is to reject the null hypothesis when it is
574 Holm’s Sequential Bonferroni Procedure
true (i.e., a false alarm, also called a Type 1 error). cannot occur simultaneously). Therefore, the prob-
This is a consequence of the logic of hypothesis ability of not making a Type I error on one trial is
testing: The null hypothesis for rare events is equal to
rejected in this entry, and the larger the number of
tests, the easier it is to find rare events that are false 1 α ¼ 1 :05 ¼ :95:
alarms. This problem is called the inflation of the
Recall that when two events are independent,
alpha level. To be protected from it, one strategy is
the probability of observing these two events
to correct the alpha level when performing multiple
together is the product of their probabilities. Thus,
tests. Making the alpha level more stringent (i.e.,
if the tests are independent, the probability of not
smaller) will create less errors, but it might also
making a Type I error on the first and the second
make it harder to detect real effects. The most well-
tests is
known correction is called the Bonferroni correc- :95 × :95 ¼ ð1 :05Þ2 ¼ ð1 αÞ2 :
tion; it consists in multiplying each probability by
the total number of tests performed. A more power- With three tests, the probability of not making
ful (i.e., more likely to detect an effect exists) a Type I error on all tests is
sequential version was proposed by Sture Holm in
1979. In Holm’s sequential version, the tests need :95 × :95 × :95 ¼ ð1 :05Þ3 ¼ ð1 αÞ3 :
first to be performed in order to obtain their p
values. The tests are then ordered from the one For a family of C tests, the probability of not mak-
with the smallest p value to the one with the largest ing a Type I error for the whole family is
p value. The test with the lowest probability is C
tested first with a Bonferroni correction involving ð1 αÞ :
all tests. The second test is tested with a Bonferroni
For this example, the probability of not making
correction involving one less test and so on for the
a Type I error on the family is
remaining tests. Holm’s approach is more powerful
than the Bonferroni approach, but it still keeps ð1 αÞC ¼ ð1 :05Þ10 ¼ :599:
under control the inflation of the Type 1 error.
Now, the probability of making one or more
The Different Meanings of Alpha Type I errors on the family of tests can be deter-
mined. This event is the complement of the event
When a researcher performs more than one statis- not making a Type I error on the family, and there-
tical test, he or she needs to distinguish between fore, it is equal to
two interpretations of the α level, which represents
the probability of a Type 1 error. The first interpre- 1 ð1 αÞC :
tation evaluates the probability of a Type 1 error
for the whole set of tests, whereas the second eval- For this example,
uates the probability for only one test at a time.
1 ð1 :05Þ10 ¼ :401:
Probability in the Family So, with an α level of .05 for each of the 10 tests,
the probability of incorrectly rejecting the null
A family of tests is the technical term for a series
hypothesis is .401.
of tests performed on a set of data. This section
This example makes clear the need to distin-
shows how to compute the probability of rejecting
guish between two meanings of α when perform-
the null hypothesis at least once in a family of tests
ing multiple tests:
when the null hypothesis is true.
For convenience, suppose that the significance 1. The probability of making a Type I error when
level is set at α ¼ .05. For each test the probability dealing only with a specific test. This
of making a Type I error is equal to α ¼ .05. The probability is denoted α[PT] (pronounced
events ‘‘making a Type I error’’ and ‘‘not making ‘‘alpha per test’’). It is also called the testwise
a Type I error’’ are complementary events (they alpha.
Holm’s Sequential Bonferroni Procedure 575
2. The probability of making at least one Type I independent tests, and you want to limit the risk
error for the whole family of tests. This of making at least one Type I error to an overall
probability is denoted α[PT] (pronounced value of α[PF] = .05, you will consider a test signif-
‘‘alpha per family of tests’’). It is also called the icant if its associated probability is smaller than
familywise or the experimentwise alpha.
α½PT ¼ 1 ð1 α½PFÞ1=C
How to Correct for Multiple Tests ¼ 1 ð1 :05Þ
1=4
¼ :0127:
Recall that the probability of making at least
one Type I error for a family of C tests is With the Bonferroni approximation, a test reaches
significance if its associated probability is smaller
α½PF ¼ 1 ð1 α½PTÞC : ð1Þ than
and should be preferred to Bonferroni, which is an pSidak; i|C ¼ 1 ð1 pÞCiþ1 ¼ pSidak; 3|3
approximation). If the test is not significant, then
the procedure stops. If the first test is significant, ¼ 1 ð1 0:000040Þ31þ1
the test with the second smallest p value is then ¼ pSidak; 1|3 ¼ 1 ð1 0:000040Þ3 ð10Þ
corrected with a Bonferroni or a Sidák approach
¼ 1 0:999603
for a family of (C 1) tests. The procedure stops
when the first nonsignificant test is obtained or ¼ 0:000119 :
when all the tests have been performed. Formally,
assume that the tests are ordered (according to Using the Bonferroni approximation (cf. Equation
their p values) from 1 to C; and that the procedure 9) will give a corrected p value of pBonferroni; 1=3 ¼
stops at the first nonsignificant test. When using .000120. Because the corrected p value for the first
the Sidák correction with Holm’s approach, test is significant, the second test can then be per-
the corrected p value for the ith test, denoted formed for which i ¼ 2 and p ¼ .016100. Using
pSidak; i=C ; is computed as Equations 8 and 9, the corrected p values of
pSidak; 2=3 ¼ .031941 and pBonferroni; 2=3 ¼ .032200
pSidak; i|C ¼ 1 ð1 pÞCiþ1 : ð8Þ are found. The corrected p values are significant,
and, so, the last test can be performed for which i ¼
When using the Bonferroni correction with 3. Because this is the last of the series, the corrected
Holm’s approach, the corrected p value for the ith p values are now equal to the uncorrected p value of
test, denoted pBonferroni; i=C ; is computed as p ¼ pSidak; 3=3 ¼ pBonferroni; 3=3 ¼ :612300, which is
clearly not significant. Table 1 gives the results of the
pBonferroni; i|C ¼ ðC i þ 1Þ × p: ð9Þ Holm’s sequential procedure along with the values
of the standard Sidák and Bonferroni corrections.
Just like the standard Bonferroni procedure, cor-
rected p values larger than 1 are set equal to 1.
Correction for Nonindependent Tests
The Sidák equation is derived assuming indepen-
Example
dence of the tests. When they are not independent,
Suppose that a study involving analysis of var- it gives a conservative estimate. The Bonferonni
iance has been designed and that there are three being a conservative estimation of Sidák will also
tests to be performed. The p values for these give a conservative estimate. Similarly, the sequen-
three tests are equal to 0.000040, 0.016100, and tial Holm’s approach is conservative when the tests
0.612300 (they have been ordered from the are not independent. Holm’s approach is obviously
smallest to the largest). Thus, C ¼ 3. The first more powerful than Sidák’s (because the pSidak; i|C
test has an original p value of p ¼ 0.000040. values are always smaller than or equal to the
Because it is the first of the series, i ¼ 1, and its pSidak; i|C values), but it still controls the overall
corrected p value using the HolmSidák familywise error rate. The larger the number of
approach (cf. Equation 8) is equal to tests, the larger the increase in power with Holm’s
H0 : σ 21 /σ 22 ¼ 1. When this null hypothesis is not more conservative methods, which alleviate the
rejected, then homogeneity of variance is con- problem heterogeneity of variance, leave the
firmed, and the assumption is not violated. researcher with less statistical power for hypothe-
The standard test for determining homogeneity sis testing to determine whether differences
of variance is the Levene’s test and is most fre- between group means exist. That is, the researcher
quently used in newer versions of statistical soft- is less likely to obtain a statistically significant
ware. Alternative approaches to Levene’s test have result using the more conservative method.
been proposed by O’Brien and by Brown and For-
sythe. For a more detailed presentation on calcu-
lating the Levene’s test by hand, refer to Howell, Robustness of t tests and ANOVAs
2007. Generally, tests of homogeneity of variance
are tests on the deviations (squared or absolute) of Problems develop when the variances of the
scores from the sample mean or median. If, for groups are extremely different from one another
example, Group A’s deviations from the mean or (if the value of the largest variance estimate is
median are larger than Group B’s deviations, then more than four or five times that of the smallest
it can be said that Group A’s variance is larger than variance estimate), or when there are large num-
Group B’s. These deviations will be larger (or bers of groups being compared in an ANOVA.
smaller) if the variance of one of the groups is Serious violations can lead to inaccurate p values
larger (or smaller). Based on the Levene’s test, it and estimates of effect size. However, t tests and
can be determined whether a statistically signifi- ANOVAs are generally considered robust when it
cant difference exists between the variances of two comes to moderate departures from the underlying
(or more) groups. homogeneity of variance assumption. Particularly
If the result of a Levene’s test is not statistically when group sizes are equal (n1 ¼ n2 Þ and large. If
significant, then there are no statistical differences the group with the larger sample also has the
between the variances of the groups in question larger variance estimate, then the results of
and the homogeneity of variance assumption is the hypothesis tests will be too conservative. If the
met. In this case, one fails to reject the null larger group has the smaller variance, then the
hypothesis H0 : σ 21 ¼ σ 22 that the variances of the results of the hypothesis test will be too liberal.
populations from which the samples were drawn Methodologically speaking, if a researcher has vio-
are the same. That is, the variances of the groups lated the homogeneity of variance assumption, he
are not statistically different from one another, and or she might consider equating the sample sizes.
t tests and ANOVAs can be performed and inter-
preted as normal. If the result of a Levene’s test is
statistically significant, then the null hypothesis, Follow-Up Tests
that the groups have equal variances, is rejected. It If the results of an ANOVA are statistically signifi-
is concluded that there are statistically significant cant, then post hoc analyses are run to determine
differences between the variances of the groups where specific group differences lie, and the results
and the homogeneity of variance assumption has of the Levene’s test will determine which post hoc
been violated. Note: The significance level will be tests are run and should be examined. Newer ver-
determined by the researcher (i.e., whether the sig- sions of statistical software provide the option of
nificance value exceeds .05, .01, etc.). running post hoc analyses that take into consider-
When the null hypothesis H0 : σ 21 ¼ σ 22 is ation whether the homogeneity of variance
rejected, and the homogeneity of variance assump- assumption has been violated.
tion is violated, it is necessary to adjust the statisti- Although this entry has focused on homogene-
cal procedure used and employ more conservative ity of variance testing for t tests and ANOVAs,
methods for testing the null hypothesis H0 : μ1 ¼ tests of homogeneity of variance for more complex
μ2 . In these more conservative procedures, the statistical models are the subject of current
standard error of difference is estimated differently research.
and the degrees of freedom that are used to test
the null hypothesis are adjusted. However, these Cynthia R. Davis
580 Homoscedasticity
distribution (e.g., normal or bell-shaped, single- or homoscedasticity. These methods are often con-
multipeaked, or skewed to the left or right), is best ducted as part of the analysis.
characterized visually using histograms, box plots,
and stem-and-leaf plots. Although it is important
Regression Analyses
to examine, individually, the distribution of each
relevant variable, it is often necessary in multivari- In regression analysis, examination of the resid-
ate analyses to evaluate the pattern that exists ual values is particularly helpful in evaluating
between two or more variables. Scatterplots are homoscedasticity violations. The goal of regression
a useful technique to display the shape, direction, analysis is that the model being tested will (ideally)
and strength of relationships between variables. account for all of the variation in Y: Variation in
the residual values suggests that the regression
model has somehow been misspecified, and graph-
Examining Atypical Data Points ical displays of residuals are informative in detect-
ing these problems. In fact, the techniques for
In addition to normality, data should always examining residuals are similar to those used with
be preemptively examined for influential data the original data to assess normality and the pres-
points. Labeling observations as outside the nor- ence of atypical data points.
mal range of data can be complicated because Scatterplots are a useful and basic graphical
decisions exist in the context of relationships method to determine homoscedasticity violations.
among variables and intended purpose of the data. A specific type of scatterplot, known as a residual
For example, outlying X values are never problem- plot, plots residual Y values along the vertical axis
atic in ANOVA designs with equal cell sizes, but and observed or predicted Y values along the hori-
they introduce significant problems in regression zontal (XÞ axis. If a constant spread in the
analyses and unbalanced ANOVA designs. How- residuals is observed across all values of X; homo-
ever, discrepant Y values are nearly always prob- scedasticity exists. Plots depicting heterosced
lematic. Visual detection of unusual observations sticity commonly show the following two patterns:
is facilitated by box plots, partial regression lever- (1) Residual values increase as values of X increase
age plots, partial residual plots, and influence- (i.e., a right-opening megaphone pattern) or
enhanced scatterplots. Examination of scatterplots (2) residuals are highest for middle values of X
and histograms of residual values often indicates and decrease as X becomes smaller or larger
the influence of discrepant values on the overall (i.e., a curvilinear relationship). Researchers often
model fit, and whether the data point is extreme superimpose Lowess lines (i.e., lines that trace the
on Y (outlier) or X (high-leverage data point). In overall trend of the data) at the mean, as well as 1
normally distributed data, statistical tests (e.g., standard deviation above and below the mean of
z-score method, Leverage statistics, or Cook’s DÞ residuals, so that patterns of homoscedasticity can
can also be used to detect discrepant observations. be more easily recognized.
Some ways of handling atypical data points in
normally distributed data include the use of
trimmed means, scale estimators, or confidence Univariate and Multivariate Analyses of Variance
intervals. Removal of influential observations In contexts where one or more of the indepen-
should be guided by the research question and dent variables is categorical (e.g., ANOVA, t tests,
impact on analysis and conclusions. Sensitivity and MANOVA), several statistical tests are often
analyses can guide decisions about whether these used to evaluate homoscedasticity.
values influence results. In the ANOVA context, homogeneity of vari-
ance violations can be evaluated using the FMax
and Levene’s test. The FMax test is computed by
Evaluating Homoscedasticity
dividing the largest variance by the smallest vari-
In addition to examining data for normality and ance within each group. If the FMax exceeds the
the presence of influential data points, graphical critical value found in the F-value table, heterosce-
and statistical methods are also used to evaluate dasticty might exist. Some researchers suggest an
582 Homoscedasticity
FMax of 3.0 or more indicates a violation assump- change relative distances between data points and,
tion. However, conservative estimates (p < .025) therefore, influence the shape of distributions.
are suggested when evaluating F ratios because the Nonlinear transformations might be useful in
FMax test is highly sensitive to issues of non-nor- multivariate analyses to normalize distributions
mality. Therefore, it is often difficult to determine and address homoscedasticity violations. Typically,
whether significant values are caused by heteroge- transformations performed on X values more
neity of variance or normality violations of the accurately address normality violations than trans-
underlying population. The Levene’s test is another formations on Y:
statistical test that assumes equal variance across Before transforming the data, it is important to
levels of the independent variable. If the p value determine both the extent to which the variable(s)
obtained from the Levene’s test is less than .05, it under consideration violate the assumptions of
can be assumed that differences between variances homoscedasticity and normality and whether atyp-
in the population exist. Compared with the FMax ical data points influence distributions and the
test, the Levene’s test has no required normality analysis. Examination of residual diagnostics and
assumption. plots, stem-and-leaf plots, and boxplots are helpful
In the case of MANOVA, when more than one to discern patterns of skewness, non-normality,
continuous dependent variable is being assessed, and heteroscedasticity. In cases where a small
the same homogeneity of variance assumption number of influential observations is producing
applies. Because there are multiple dependent vari- heteroscedasticity, removal of these few cases
ables, a second assumption exists that the intercor- might be more appropriate than a variable
relations among these dependent measures (i.e., transformation.
covariances) are the same across different cells or
groups of the design. Box’s M test for equality of
Tukey’s Ladder of Power Transformations
variancecovariance matrices is used to test this
assumption. A statistically significant (p < :05) Tukey’s ladder of power transformations
Box’s M test indicates heteroscedasticity. However, (‘‘Bulging Rule’’) is one of the most commonly
results should be interpreted cautiously because used and simple data transformation tools. Mov-
the Box’s M test is highly sensitive to departures ing up on the ladder (i.e., applying exponential
from normality. functions) reduces negative skew and pulls in low
outliers. Roots and logs characterize descending
functions on Tukey’s ladder and address problems
Remediation for with positive skew and high atypical data points.
Violations of Homoscedasticity Choice of transformation strategy should be based
on the severity of assumption violations. For
Data Transformations
instance, square root functions are often suggested
Because homoscedasticity violations typically to correct a moderate violation and inverse square
result from normality violations or the presence of root functions are examples of transformations
influential data points, it is most beneficial to that address more severe violations.
address these violations first. Data transformations
are mathematical procedures that are used to
Advantages and Disadvantages
modify variables that violate the statistical
assumption of homoscedasticity. Two types of There are advantages and disadvantages to con-
data transformations exist: linear and nonlinear ducting data transformations. First, they can reme-
transformations. Linear transformations, produced diate homoscedasticity problems and improve
by adding, subtracting, multiplying, dividing, or accuracy in a multivariate analysis. However, inter-
a combination of these functions a constant value pretability of results is often challenged because
to the variable under consideration, preserve the transformed variables are quite different from the
relative distances of data points and the shape of original data values. In addition, transformations
the distribution. Conversely, nonlinear transforma- to variables where the scale range is less than 10
tions use logs, roots, powers, and exponentials that are often minimally effective. After performing
Honestly Significant Difference (HSD) Test 583
|Maþ Ma0 þ | > HSD ¼ Table 2 ANOVA Results for the Replication of Loftus
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi & Palmer (1974)
1 1 1 ð2Þ
qA;α MSSðAÞ ð þ Þ: Source df SS MS F Pr(F)
2 Sa Sa0 Between: A 4 1,460.00 365.00 4.56 .0036
Error: S(A) 45 3,600.00 80.00
When there is an equal number of observation Total 49 5,060.00
per group, Equation 2 can be simplified as
Source: Adapted from Loftus & Palmer (1974).
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
MSSðAÞ |Maþ Ma0 þ | ≥ HSD; ð4Þ
HSD ¼ qA;α : ð3Þ
S
then the comparison is declared significant at the
To evaluate the difference between the means of chosen α level (usually .05 or .01). Then this proce-
Groups a and a0 , the absolute value of the differ- dure is repeated for all AðA1Þ
2 comparisons.
ence between the means is taken and compared Note that HSD has less power than almost
with the value of HSD. If all other post hoc comparison methods (e.g., Fisher’s
Table 3 HSD
Experimental Group
M1:þ M2:þ M3:þ M4:þ M5:þ
Contact 30 Hit 1 35 Bump 38 Collide 41 Smash 46
M1:þ ¼ 30 Contact 000 5.00 ns 8.00 ns 11.00 ns 16.00**
M2:þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00 ns
M3:þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4:þ ¼ 41 Collide 0.00 5.00 ns
M5:þ ¼ 46 Smash 0.00
Source: Difference between means and significance of pairwise comparisions from the (fictitious) replication of Loftus &
Palmer (1974).
Notes: Differences larger than 11.37 are significant at the a ¼ .05 level and are indicated with *, and differences larger than
13.86 are significant at the a ¼ .01 level and are indicated with **.
Hypothesis 585
LSD or NewmannKeuls) except the Scheffé The differences and significance of all pairwise
approach and the Bonferonni method because the α comparisons are shown in Table 3.
level for each difference between means is set at the
same level as the largest difference. Herve Abdi and Lynne J. Williams
of existing theory and past research, and they groups. Setting up the null hypothesis is an essen-
motivate the design of the study. The variables tial step in testing statistical significance. After for-
represent the embodiment of the hypotheses in mulating a null hypothesis, one can establish the
terms of what the researcher can manipulate and probability of observing the obtained data.
observe.
A hypothesis is sometimes described as an edu-
Alternative Hypothesis
cated guess. However, this statement is also ques-
tioned to be a good description of hypothesis. For The alternative hypothesis and the null
example, many people might agree with the hypothesis are the two rival hypotheses whose
hypothesis that an ice cube will melt in less than likelihoods are compared by a statistical hypoth-
30 minutes if put on a plate and placed on a table. esis test. For example, an alternative hypothesis
However, after doing quite a bit of research, one can be a statement that the means, variance, and
might learn about how temperature and air pres- so on, of the samples being tested are not equal.
sure can change the state of water and restate the It describes the possibility that the observed dif-
hypothesis as an ice cube will melt in less than 30 ference or effect is true. The classic approach to
minutes in a room at sea level with a temperature decide whether the alternative hypothesis will be
of 20 C or 68 F. If one does further research and favored is to calculate the probability that the
gains more information, the hypothesis might observed effect will occur if the null hypothesis
become an ice cube made with tap water will melt is true. If the value of this probability (p value) is
in less than 30 minutes in a room at sea level with sufficiently small, then the null hypothesis will
a temperature of 20 C or 68 F. This example be rejected in favor of the alternative hypothesis.
shows that a hypothesis is not really just an edu- If not, then the null hypothesis will not be
cated guess. It is a tentative explanation for an rejected.
observation, phenomenon, or scientific problem
that can be tested by further investigation. In other
Examples of Null Hypothesis
words, a hypothesis is a tentative statement about
and Alternative Hypothesis
the expected relationship between two or more
variables. The hypothesis is tentative because its If a two-tailed alternative hypothesis is that
accuracy will be tested empirically. application of Educational Program A will
influence students’ mathematics achievements
(Ha : μProgram A 6¼ μcontrol Þ, the null hypothesis is
Types of Hypotheses that application of Program A will have no
effect on students’ mathematics achievements
Null Hypothesis
(H0 : μProgram A ¼ μcontrol Þ. If a one-tailed alterna-
In statistics, there are two types of hypotheses: tive hypothesis is that application of Program A
null hypothesis (H0 Þ and alternative/research/ will increase students’ mathematics achieve-
maintained hypothesis (Ha Þ. A null hypothesis ments (Ha : μProgram A > μcontrol Þ, the null hypoth-
(H0 Þ is a falsifiable proposition, which is assumed esis remains that use of Program A will have no
to be true until it is shown to be false. In other effect on students’ mathematics achievements
words, the null hypothesis is presumed true until (H0 : μProgram A ¼ μcontrol Þ. It is not merely the
statistical evidence, in the form of a hypothesis opposite of the alternative hypothesis—that is, it
test, indicates it is highly unlikely. When the is not that the application of Program A will not
researcher has a certain degree of confidence, usu- lead to increased mathematics achievements in
ally 95% to 99%, that the data do not support the students. However, this does remain the true null
null hypothesis, the null hypothesis will be hypothesis.
rejected. Otherwise, the researcher will fail to
reject the null hypothesis.
Hypothesis Writing
In scientific and medical applications, the null
hypothesis plays a major role in testing the signifi- What makes a good hypothesis? Answers to the
cance of differences in treatment and control following three questions can help guide
Hypothesis 587
hypothesis writing: (1) Is the hypothesis based on effect if the test is one-sided, for the sake of over-
the review of the existing literature? (2) Does the coming this ambiguity.
hypothesis include the independent and dependent
variables? (3) Can this hypothesis be tested in the Jie Chen, Neal Kingston,
experiment? For a good hypothesis, the answer to Gail Tiemann, and Fei Gu
every question should be ‘‘Yes.’’
Some statisticians argue that the null hypothesis See also Directional Hypothesis; Nondirectional
cannot be as general as indicated earlier. They Hypotheses; Null Hypothesis; Research Hypothesis;
believe the null hypothesis must be exact and free ‘‘Sequential Tests of Statistical Hypotheses’’
of vagueness and ambiguity. According to this
view, the null hypothesis must be numerically Further Readings
exact—it must state that a particular quantity or
difference is equal to a particular number. Agresti, A., & Finlay, B. (2008). Statistical methods for the
Some other statisticians believe that it is desir- social sciences (4th ed.). San Francisco, CA: Dellen.
Nolan, S. A., & Heinzen, T. E. (2008). Statistics for the
able to state direction as a part of null hypothesis
behavioral sciences (3rd ed.). New York: Worth.
or as part of a null hypothesis/alternative hypothe- Shavelson, R. J. (1998). Statistical reasoning for the
sis pair. If the direction is omitted, then it will be behavioral sciences (3rd ed.). Needham Heights, MA:
quite confusing to interpret the conclusion if the Allyn & Bacon.
null hypothesis is not rejected. Therefore, they Slavin, R. (2007). Educational research in an age of
think it is better to include the direction of the accountability. Upper Saddle River, NJ: Pearson.
I
to real life, thus influencing the utility and applica-
INCLUSION CRITERIA bility of study findings. Inclusion criteria must be
selected carefully based on a review of the litera-
ture, in-depth knowledge of the theoretical frame-
Inclusion criteria are a set of predefined charac- work, and the feasibility and logistic applicability
teristics used to identify subjects who will be of the criteria. Often, research protocol amend-
included in a research study. Inclusion criteria, ments that change the inclusion criteria will result
along with exclusion criteria, make up the selec- in two different sample populations that might
tion or eligibility criteria used to rule in or out require separate data analyses with a justification
the target population for a research study. Inclu- for drawing composite inferences.
sion criteria should respond to the scientific The selection and application of inclusion crite-
objective of the study and are critical to accom- ria also will have important consequences on the
plish it. Proper selection of inclusion criteria will assurance of ethical principles; for example,
optimize the external and internal validity of the including subjects based on race, gender, age, or
study, improve its feasibility, lower its costs, and clinical characteristics also might imply an uneven
minimize ethical concerns; specifically, good distribution of benefits and harms, threats to the
selection criteria will ensure the homogeneity of autonomy of subjects, and lack of respect. Not
the sample population, reduce confounding, and including women, children, or the elderly in the
increase the likelihood of finding a true associa- study might have important ethical implications
tion between exposure/intervention and out- and diminish the compliance of the study with
comes. In prospective studies (cohort and research guidelines such as those of the National
clinical trials), they also will determine the feasi- Institutes of Health in the United States for inclu-
bility of follow-up and attrition of participants. sion of women, children, and ethnic minorities in
Stringent inclusion criteria might reduce the gen- research studies.
eralizability of the study findings to the target Use of standardized inclusion criteria is neces-
population, hinder recruitment and sampling of sary to accomplish consistency of findings across
study subjects, and eliminate a characteristic that similar studies on a research topic. Common inclu-
might be of critical theoretical and methodologi- sion criteria refer to demographic, socioeconomic,
cal importance. health and clinical characteristics, and outcomes
Each additional inclusion criterion implies a dif- of study subjects. Meeting these criteria requires
ferent sample population and will add restrictions screening eligible subjects using valid and reliable
to the design, creating increasingly controlled con- measurements in the form of standardized expo-
ditions, as opposed to everyday conditions closer sure and outcome measurements to ensure that
589
590 Inclusion Criteria
subjects who are said to meet the inclusion criteria The selection of inclusion criteria should be
really have them (sensitivity) and those who are guided by ethical and methodological issues; for
said not to have them really do not have them example, in a clinical trial to treat iron deficiency
(specificity). Such measurements also should be anemia among reproductive-age women, including
consistent and repeatable every time they are women to assess an iron supplement therapy
obtained (reliability). Good validity and reliability would not be ethical if women with life-threatening
of inclusion criteria will help minimize random or very low levels of anemia are included in a non-
error, selection bias, misclassification of exposures treatment arm of a clinical trial for follow-up with
and outcomes, and confounding. Inclusion criteria an intervention that is less than the standard of
might be difficult to ascertain; for example, an care. Medication washout might be established as
inclusion criterion stating that ‘‘subjects with type an inclusion criterion to prevent interference of
II diabetes mellitus and no other conditions will be a therapeutic drug on the treatment under study. In
included’’ will require, in addition to clinical ascer- observational prospective studies, including sub-
tainment of type II diabetes mellitus, evidence that jects with a disease to assess more terminal clinical
subjects do not have cardiovascular disease, hyper- endpoints without providing therapy also would be
tension, cancer, and so on, which will be costly, unethical, even if the population had no access to
unfeasible, and unlikely to rule out completely. A medical care before the study.
similar problem develops when using as inclusion In observational studies, inclusion criteria are
criterion ‘‘subjects who are in good health’’ used to control for confounding, in the form of
because a completely clean bill of health is difficult specification or restriction, and matching. Specifi-
to ascertain. Choosing inclusion criteria with high cation or restriction is a way of controlling con-
validity and reliability will likely improve the like- founding; potential confounder variables are
lihood of finding an association, if there is one, eliminated from the study sample, thus removing
between the exposures or interventions and the any imbalances between the comparison groups.
outcomes; it also will decrease the required sample Matching is another strategy to control confound-
size. For example, inclusion criteria such as tumor ing; matching variables are defined by inclusion
markers that are known to be prognostic factors criteria that will homogenize imbalances between
of a given type of cancer will be correlated more comparison groups, thus removing confounding.
strongly with cancer than unspecific biomarkers or The disadvantage is that variables eliminated by
clinical criteria. Inclusion criteria that identify restriction or balanced by matching will not be
demographic, temporal, or geographic characteris- amenable to assessment as potential risk factors
tics will have scientific and practical advantages for the outcome at hand. This also will limit gener-
and disadvantages; restricting subjects to male gen- alizability, hinder recruitment, and require more
der or adults might increase the homogeneity of time and resources for sampling. In studies of
the sample, thus helping to control confounding. screening tests, inclusion criteria should ensure the
Inclusion criteria that include selection of subjects selection of the whole spectrum of disease severity
during a certain period of time might overlook and clinical forms. Including limited degrees of dis-
important secular trends in the phenomenon under ease severity or clinical forms will likely result in
study, but not establishing a feasible period of time biased favorable or unfavorable assessment of
might make conducting of the study unfeasible. screening tests.
Geographic inclusion criteria that establish select- Sets of recommended inclusion criteria have
ing a population from a hospital also might select been established to enhance methodological rigor
a biased sample that will preclude the generaliz- and comparability between studies; for example,
ability of the findings, although it might be the the American College of Chest Physicians and
only alternative to conducting the study. In studies Society of Critical Care Medicine developed inclu-
of rheumatoid arthritis, including patients with at sion criteria for clinical trials of sepsis; new criteria
least 12 tender or swollen joints will make difficult rely on markers of organ dysfunction rather than
the recruitment of a sufficient number of patients on blood culture positivity or clinical signs and
and will likely decrease the generalizability of symptoms. Recently, the Scoliosis Research Society
study results to the target population. in the United States has proposed new
Independent Variable 591
standardized inclusion criteria for brace studies in Inclusion criteria for experimental studies
the treatment of adolescent idiopathic scoliosis. involve different considerations than those for
Also, the International Campaign for Cures of Spi- observational studies. In clinical trials, inclusion
nal Cord Injury Paralysis has introduced inclusion criteria should maximize the generalizability of
and exclusion criteria for the conduct of clinical findings to the target population by allowing the
trials for spinal cord injury. Standardized inclusion recruitment of a sufficient number of individuals
criteria must be assessed continuously because it with expected outcomes, minimizing attrition rates,
might be possible that characteristics used as inclu- and providing a reasonable follow-up time for
sion criteria change over time; for example, it has effects to occur.
been shown that the median numbers of swollen Automatized selection and standardization of
joints in patients with rheumatoid arthritis has inclusion criteria for clinical trials using electronic
decreased over time. Drawing inferences using old health records has been proposed to enhance the
criteria that are no longer valid would no longer consistency of inclusion criteria across studies.
be relevant for the current population.
In case control studies, inclusion criteria will Eduardo Velasco
define the subjects with the disease and those with-
See also Bias; Confounding; Exclusion Criteria;
out it; cases and controls should be representative
Reliability; Sampling; Selection; Sensitivity; Specificity;
of the diseased and nondiseased subjects in the tar-
Validity of Research Conclusions
get population. In occupational health research, it
is known that selecting subjects in the work setting
might result in a biased sample of subjects who are Further Readings
healthier and at lower risk than the population at Gordis, L. (2008). Epidemiology (4th ed.). Philadelphia:
large. Selection of controls must be independent of W. Saunders.
exposure status, and nonparticipation rates might Hulley, S. B., Cummings, S. R., Browner, W. S., Grady,
introduce bias. Matching by selected variables will D., Hearst, N., & Newman, T. B. (2007). Designing
remove the confounding effects of those variables clinical research: An epidemiologic approach (3rd ed.).
on the association under study; obviously, those Philadelphia: Wolters, Kluwer, Lippincott, Williams &
variables will not be assessed as predictors of the Wilkins.
outcome. Matching also might introduce selection LoBiondo-Wood G., & Haber J. (2006). Nursing
research: Methods and critical appraisal for evidence-
bias, complicate recruitment of subjects, and limit
based practice (6th ed.). St. Louis, MO: Mosby.
inference to the target population. Additional spe-
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
cific inclusion criteria will be needed for nested Experimental and quasi-experimental designs for
case control and case cohort designs and two-stage generalized causal inference. Boston: Houghton-
or multistage sampling. Common controls are Mifflin.
population controls, neighborhood controls, hospi- Szklo, M., & Nieto F. J. (2007). Epidemiology: Beyond
tal or registry controls, friends, relatives, deceased the basics. Sudbury, MA: Jones & Bartlett Publishers.
controls, and proxy respondents. Control selection
usually requires that controls remain disease free
for a given time interval, the exclusion of controls
who become incident cases, and the exclusion of INDEPENDENT VARIABLE
controls who develop diseases other than the one
studied, but that might be related to the exposure Independent variable is complementary to depen-
of interest. dent variable. These two concepts are used primar-
In cohort studies, the most important inclusion ily in their mathematical sense, meaning that the
criterion is that subjects do not have the disease value of a dependent variable changes in response
outcome under study. This will require ascertain- to that of an independent variable. In research
ment of disease-free subjects. Inclusion criteria design, independent variables are those that
should allow efficient accrual of study subjects, a researcher can manipulate, whereas dependent
good follow-up participation rates, and minimal variables are the responses to the effects of inde-
attrition. pendent variables. By purposefully manipulating
592 Independent Variable
selection and control of independent variables See also Bivariate Regression; Covariate; Dependent
before and during a study is fundamental to both Variable
the internal and the external validity of that study.
To illustrate what constitutes a dependent vari-
able and what is an independent variable, let us Further Readings
assume an agricultural experiment on the produc-
Hockling, R. R. (2003). Methods and applications of
tivity of two wheat varieties that are grown under linear models: Regression and the analysis of variance.
identical or similar field conditions. Productivity Hoboken, NJ: Wiley.
is measured by tons of wheat grains produced Kuehl, R. O. (1994). Statistical principles of research
per season per hectare. In this experiment, variety design and analysis. Belmont, CA: Wadsworth.
would be the independent variable and productiv- Montgomery, D. C. (2001). Design and analysis of
ity the dependent variable. The qualifier, ‘‘identical experiments (5th ed.). Toronto, Ontario, Canada: Wiley.
or similar field conditions,’’ implies other extrane- Ramsey, F. L., & Schafer, D. W. (2002). The statistical
ous (or nuisance) factors (i.e., covariates) that must sleuth: A course in methods of data analysis (2nd ed.).
be controlled, or taken account of, in order for the Pacific Grove, CA: Duxbury.
Wacherly, D. D., Mendenhall , W., III, & Scheaffer, R. L.
results to be valid. These other factors might be the
(2002). Mathematical statistics with applications
soil fertility, the fertilizer type and amount, irriga- (6th ed.). Pacific Grove, CA: Duxbury.
tion regime, and so on. Failure to control Zolman, J. F. (1993). Biostatistics: Experimental design
or account for these factors could invalidate the and statistical inference. New York: Oxford University
experiment. This is an example of controlled Press.
experiments. Similar examples of controlled experi-
ments might be the temperature effect on the hard-
ness of a type of steel and the speed effect on the
crash result of automobiles in safety tests.
Consider also an epidemiological study on the
INFERENCE: DEDUCTIVE
relationship between physical inactivity and obesity AND INDUCTIVE
in young children: The parameter(s) that measures
physical inactivity, such as the hours spent on Reasoning is the process of making inferences—of
watching television and playing video games, and drawing conclusions. Students of reasoning make
the means of transportation to and from daycares/ a variety of distinctions regarding how inferences
schools is the independent variable. These are cho- are made and conclusions are drawn. Among the
sen by the researcher based on his or her prelimi- oldest and most durable of them is the distinction
nary research or on other reports in literature on between deductive and inductive reasoning, which
the same subject prior to the study. The param- contrasts conclusions that are logically implicit in
eter(s) that measure obesity, such as the body mass the claims from which they are drawn with those
index, is (are) the dependent variable. To control that go beyond what is given.
for confounding, the researcher needs to consider, Deduction involves reasoning from the general
other than the main independent variables, any to the particular:
covariate that might influence the dependent vari-
able. An example might be the social economical All mammals nurse their young.
status of the parents and the diet of the families.
Whales are mammals.
Independent variables are predetermined factors
that one controls and/or manipulates in a designed Therefore whales nurse their young.
experiment or an observational study. They are
design variables that are chosen to incite a response Induction involves reasoning from the particular
of a dependent variable. Independent variables are to the general:
not the primary interest of the experiment; the
dependent variable is. All the crows I have seen are black.
Being black must be a distinguishing feature of
Shihe Fan crows.
594 Inference: Deductive and Inductive
arguments. These tools include syllogistic forms, about the role of guessing and conjecturing in
calculi of classes and propositions, Boolean alge- mathematics.
bra, and a variety of diagrammatic aids to analysis
such as truth tables, Euler diagrams, and Venn dia-
The Interplay of Deduction and Induction
grams. Induction does not lend itself so readily to
formalization; indeed (except in the case of mathe- Any nontrivial cognitive problem is almost certain
matical induction, which is really a misnamed to require the use of both deductive and inductive
form of deduction) inductive reasoning is almost inferencing, and one might find it difficult to
synonymous with informal reasoning. It has to do decide, in many instances, where the dividing line
with weighing evidence, judging plausibility, and is between the two. In science, for example, the
arriving at uncertain conclusions or beliefs that interplay between deductive and inductive reason-
one can hold with varying degrees of confidence. ing is continual. Observations of natural phenom-
Deductive arguments can be determined to be ena prompt generalizations that constitute the
valid or invalid. The most one can say about an stuff of hypotheses, models, and theories. Theories
inductive argument is that it is more or less provide the basis for the deduction of predictions
convincing. regarding what should be observed under specified
Logic is often used to connote deductive reason- conditions. Observations are made under the con-
ing only; however, it can be sufficiently broadly ditions specified, and the predictions are either cor-
defined to encompass both deductive and inductive roborated or falsified. If falsification is the result,
reasoning. Sometimes a distinction is made between the theories from which the predictions were
formal and informal logic, to connote deductive deduced must be modified and this requires induc-
and inductive reasoning, respectively. tive reasoning—guesswork and more hypothesiz-
Philosophers and logicians have found it much ing. The modified theories provide the basis for
easier to deal with deductive than with inductive deducing new predictions. And the cycle goes on.
reasoning, and as a consequence, much more has In mathematics, a similar process occurs. A sug-
been written about the former than about the lat- gestive pattern is observed and the mathematician
ter, but the importance of induction is clearly rec- induces a conjecture, which, in some cases,
ognized. Induction has been called the despair of becomes a theorem—which is to say it is proved
the philosopher, but no one questions the necessity by rigorous deduction from a specified set of
of using it. axioms. Mathematics textbooks spend a lot of
Many distinctions similar to that between time on the proofs of theorems, emphasizing the
deductive and inductive reasoning have been made. deductive side of mathematics. What might be less
Mention of two of them will suffice to illustrate apparent, but no less crucial to the doing of math-
the point. American philosopher/mathematician/ ematics, is the considerable guesswork and induc-
logician Charles Sanders Peirce drew a contrast tion that goes into the identification of conjectures
between a demonstrative argument, in which the that are worth exploring and the construction of
conclusion is true whenever the premises are true, proofs that will be accepted as such by other
and a probabilistic argument, in which the conclu- mathematicians.
sion is usually true whenever the premises are true. Deduction and induction are essential also to
Hungarian/American mathematician George P olya meet the challenges of everyday life, and we all
distinguished between demonstrative reasoning make extensive use of both, which is not to claim
and plausible reasoning, demonstrative reasoning that we always use them wisely and well. The psy-
being the kind of reasoning by which mathematical chological research literature documents numerous
knowledge is secured, and plausible reasoning that ways in which human reasoning often leads to
which we use to support conjectures. Ironically, conclusions that cannot be justified either logically
although P olya equated demonstrative reasoning or empirically. Nevertheless, that the type of rea-
with mathematics and described all reasoning out- soning that is required to solve structured prob-
side of mathematics as plausible reasoning, he lems for the purposes of experimentation in the
wrote extensively, especially in his 1954 two- psychological laboratory does not always ade-
volume Mathematics and Plausible Reasoning, quately represent the reasoning that is required to
596 Influence Statistics
deal with the problems that present themselves in See also Experimental Design; Falsifiability; Hypothesis;
real life has been noted by many investigators, Margin of Error; Nonexperimental Designs; Pre-
and it is reflected in contrasts that are drawn Experimental Designs; Quasi-Experimental Designs
between pure (or theoretical) and practical
thinking, between academic and practical intelli-
gence, between formal and everyday reasoning, Further Readings
and between other distinctions of a similar nature. Cohen, L. J. (1970). The implications of induction.
London: Methuen.
Galotti, K. M. (1989). Approaches to studying formal
The Study of Inferencing and everyday reasoning. Psychological Bulletin, 105,
331–351.
The study of deductive reasoning is easier than the Holland, J. H., Holyoak, K. J., Nisbett, R. E., &
study of inductive reasoning because there are Thagard, P. R. (1986). Induction: Processes of
widely recognized rules for determining whether inference, learning, and discovery. Cambridge: MIT
a deductive argument is valid, whereas there are Press.
not correspondingly widely recognized rules for Johnson-Laird, P. N., & Byrne, R. M. J. (1991).
determining whether an inductive argument is Deduction. Hillside, NJ: Lawrence Erlbaum.
sound. Perhaps, as a consequence, deductive rea- P
olya, G. (1954). Mathematics and plausible reasoning,
soning has received more attention from research- Vol. 1: Induction and analogy in mathematics, Vol. 2:
Patterns of plausible inference. Princeton, NJ:
ers than has inductive reasoning.
Princeton University Press.
Several paradigms for investigating deduction
Rips, L. J. (1994). The psychology of proof: Deductive
have been used extensively by students of cogni- reasoning in human thinking. Cambridge: MIT Press.
tion. None is more prominent than the ‘‘selection
task’’ invented by British psychologist Peter Wason
in the 1960s. In its simplest form, a person is
shown four cards, laid out so that only one side of INFLUENCE STATISTICS
each card is visible, and is told that each card has
either a vowel or a consonant on one side and
either an even number or an odd number on the Influence statistics measure the effects of individual
other side. The visible sides of the cards show data points or groups of data points on a statistical
a vowel, a consonant, an even number, and an odd analysis. The effect of individual data points on an
number. The task is to specify which card or cards analysis can be profound, and so the detection of
must be turned over to determine the truth or fal- unusual or aberrant data points is an important part
sity of the claim If there is a vowel on one side, of nearly every analysis. Influence statistics typically
there is an even number on the other. The correct focus on a particular aspect of a model fit or data
answer, according to conditional logic, is the card analysis and attempt to quantify how the model
showing a vowel and the one showing an odd changes with respect to that aspect when a particular
number. The original finding was that only a small data point or group of data points is included in the
minority of people given this task perform it cor- analysis. In the context of linear regression, where
rectly; the most common selections are either the the ideas were first popularized in the 1970s, a vari-
card showing a vowel and the one showing an ety of influence measures have been proposed to
even number, or only the one showing a vowel. assess the impact of particular data points.
The finding has been replicated many times and The popularity of influence statistics soared in
with many variations of the original task. Several the 1970s because of the proliferation of fast and
interpretations of the result have been proposed. relatively cheap computing, a phenomenon that
That the task remains a focus of research more allowed the easy examination of the effects of indi-
than 60 years after its invention is a testament to vidual data points on an analysis for even rela-
the ingenuity of its inventor and to the difficulty of tively large data sets. Seminal works by R. Dennis
determining the nature of human reasoning. Cook; David A. Belsley, Edwin Kuh, and Roy E.
Welsch; and R. Dennis Cook and Sanford
Raymond S. Nickerson Weisberg led the way for an avalanche of new
Influence Statistics 597
techniques for assessing influence. Along with sample size and p is the number of estimated
these new techniques came an array of names for regression coefficients.
them: DFFITS, DFBETAS, COVRATIO, Cook’s Influence with respect to estimated model coeffi-
D; and leverage, to name but a few of the more cients can be measured either for individual coeffi-
prominent examples. Each measure was designed cients, using a measure called DFBETAS, or
to assess the influence of a data point on a partic- through an overall measure of how individual data
ular aspect of the model fit: DFFITS on the fitted points affect estimated coefficients as a whole.
values from the model, DFBETAS on each indi- DFBETAS is a scaled difference between estimated
vidual regression coefficient, COVRATIO on the coefficients for models fit with and without each
estimated residual standard error, and so on. individual datum, respectively:
Each measure can be readily computed using
widely available statistical packages, and their ^k β
β ^ k;i
use as part of an exploratory analysis of data is DFBETASk;i ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; for k ¼ 1; . . . ; p;
MSEðiÞ ckk
very common.
This entry first discusses types of influence sta-
where p is the number of coefficients and ckk is
tistics. Then we describe the calculation and lim-
the kth diagonal element of the matrix ðXT XÞ1 .
itations of influence statistics. Finally, we conclude
Again, although DFBETASk;i resembles a t statis-
with an example.
tic, it fails to have a t distribution, and its size is
judged relative to a cutoff proposed by Belsley,
Types Kuh, and Welsch whereby the ith point is regarded
as influential with respect to the
pffiffiffi kth estimated
Influence measures are typically categorized by the
coefficient if DFBETASk;i > 2= n.
aspect of the model to which they are targeted.
Cook’s distance calculates an overall measure of
Some commonly used influence statistics in the
distance between coefficients estimated using mod-
context of linear regression models are discussed
els with and without each respective data point:
and summarized next. Analogs are also available
for generalized linear models and for other more ^ k;i ÞT XT Xðβ
^k β ^k β
^ k;i Þ
complex models, although these are not described ðβ
Di ¼ :
in this entry. pMSE
Influence with respect to fitted values of a model
can be assessed using a measure called DFFITS, There are several rules of thumb commonly
a scaled difference between the fitted values for used to judge the size of Cook’s distance in asses-
the models fit with and without each individual sing influence, with some practitioners using rela-
respective data point: tive standing among the values of the Di s, whereas
others prefer to use the 50% critical point of the
Y^i Y ^ i;i Fp;np distribution.
DFFITSi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , Influence with respect to the estimate of residual
MSEðiÞ hii
standard error in a model fit can be assessed using
where the notation in the numerator denotes fitted a quantity COVRATIO that measures the change
values for the response for models fit with and in the estimate of error spread between models fit
without the ith data point, respectively, MSEðiÞ is with and without the ith data point:
the mean square for error in the model fit without
s 2p 1
data point i; and hii is the ith leverage; that is, the COVRATIOi ¼
ðiÞ
;
ith diagonal element of the hat matrix, H ¼ X s 1 hii
ðXT XÞ1 XT . Although DFFITSi resembles a t sta-
tistic, it does not have a t distribution, and the size where sðiÞ is the estimate of residual standard error
of DFFITSi is judged relative to a cutoff proposed from a model fit without the ith data point. Influ-
by Belsley, Kuh, and Welsch. A point is regarded ence with respect to residual scale is assessed if
as potentially influentialpwith
ffiffiffiffiffiffiffiffi respect to fitted a point has a value of COVRATIOi for which
values if jDFFITSi j > 2 p=n, where n is the jCOVRATIOi 1j ≥ 3p=n.
598 Influence Statistics
Many influence measures depend on the values and similar expressions not requiring multiple
of the leverages, hii , which are the diagonal ele- model fits can be developed for the other influence
ments of the hat matrix. The leverages are a func- measures considered earlier.
tion of the explanatory variables alone and,
therefore, do not depend on the response variable
Limitations
at all. As such, they are not a direct measure of
influence, but it is observed in a large number of Each influence statistic discussed so far is an exam-
situations that cases having high leverage tend to ple of a single-case deletion statistic, based on
be influential. The leverages are closely related to comparing models fit on data sets differing by only
the Mahalanobis distances of each data point’s one data point. In many cases, however, more than
covariate values from the centroid of the covariate one data point in a data set exerts influence, either
space, and so points with high leverage are in that individually or jointly. Two problems that can
sense ‘‘far’’ from the center of the covariate space. develop in the assessment of multiple influence are
Because the average of the leverages is equal to masking and swamping. Masking occurs when an
p=n, where p is the number of covariates plus 1, it influential point is not detected because of the
is common to consider points with twice the aver- presence of another, usually adjacent, influential
age leverage as having the potential to be influen- point. In such a case, single-case deletion influence
tial; that is, points with hii > 2p=n would be statistics fail because only one of the two poten-
investigated further for influence. Commonly, the tially influential points is deleted, respectively,
use of leverage in assessing influence would occur when computing the influence statistic, still leaving
in concert with investigation of other influence the other data point to influence the model fit.
measures. Swamping occurs when ‘‘good’’ data points are
identified as influential because of the presence of
other, usually remote, influential data points that
Calculation
influence the model away from the ‘‘good’’ data
Although the formulas given in the preceding point. It is difficult to overcome the potential pro-
discussion for the various influence statistics are blems of masking and swamping for several rea-
framed in the context of models fit with and sons: First, in high-dimensional data, visualization
without each individual data point in turn, the is often difficult, making it very hard to ‘‘see’’
calculation of these statistics can be carried out which observations are ‘‘good’’ and which are not;
without the requirement for multiple model fits. second, it is almost never the case that the exact
This computational saving is particularly impor- number of influential points is known a priori, and
tant in the context of large data sets with many points might exert influence either individually or
covariates, as each influence statistic would oth- jointly in groups of unknown size; and third, mul-
erwise require n þ 1 separate model fits in its cal- tiple-case deletion methods, although simple in
culation. Efficient calculation is possible through conception, remain difficult to implement in prac-
the use of updating formulas. For example, the tice because of the computational burden associ-
values of sðiÞ , the residual standard error from ated with assessing model fits for very large
a model fit without the ith data point, can be numbers of subsets of the original data.
computed via the formula
Examples
e2i
ðn p 1Þs2ðiÞ ¼ ðn pÞs2 ;
1 hii A simple example concludes this entry. A ‘‘good’’
data set with 20 data points was constructed, to
where s is the residual standard error fit using the
which was added, first, a single obvious influential
entire data set and ei is the model errors from the
point, and then a second, adjacent influential
model fit to the full data set. Similarly,
point. The first panel of Figure 1 depicts the origi-
pffiffiffiffiffi nal data, and the second and third panels show the
ei hii augmented data. An initial analysis of the ‘‘good’’
DFFITSi ¼
sðiÞ ð1 hii Þ data reveals no points suspected of being
Influence Statistics 599
Original Data One Influential Point Added Two Influential Points Added
22 22 22
20 20 20
18 18 18
21
16 16 16 22
Y
Y
21
14 14 14
12 12 12
10 10 10
4 4 4
2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 10 20 30 40 50 10 20 30 40 50
X X X
influential. When the first influential point is whereas the dot-dash line reflects the fitted model
inserted in the data, its impact on the model is using all data points except point 21. Of course, in
extreme (see the middle plot), and the influence this simple example, the effects of the data points
statistics clearly point to this point as being influ- marked in the plot are clearly visible—the simple
ential. When the second extreme point is added two-dimensional case usually affords such an easy
(see the rightmost plot), its presence obscures the visualization. In higher dimensions, such visualiza-
influence of the initially added point (point 22 tion is typically not possible, and so the values of
masks point 21, and vice versa), and the pair influence statistics become more useful as tools for
of added points causes a known ‘‘good’’ point, identifying unusual or influential data points.
labeled 4, to be considered influential (the pair Table 1 shows the values of the various influ-
(21,22) swamps point 4). In the plots, the fitted ence statistics for the example depicted in the
model using all data points is marked using a solid figure. Values of DFFITS, DFBETAS, Cook’s D;
line, whereas the fitted model using only the COVRATIO, and leverage are given for the situa-
‘‘good’’ data points is marked using a dashed line. tions depicted in the middle and right panels of the
In the rightmost plot, the dotted line reflects the figure. The values of the influence statistics for the
fitted model using all data points except point 22, case of the single added influential point show
600 Influential Data Points
how effectively the influence statistics betray the dominate the outcome of an analysis with hun-
added influential point—their values are extremely dreds of observations: It might spell the differ-
high across all statistics. The situation is very dif- ence between rejection and failure to reject
ferent, however, when a second, adjacent influen- a null hypothesis or might drastically change
tial point is added. In that case, the two added estimates of regression coefficients. Assessing
points mask each other, and at the same time, they influence can reveal data that are improperly
swamp a known ‘‘good’’ point. The dotted line measured or recorded, and it might be the first
and the dot-dash line in the rightmost panel of clue that certain observations were taken under
Figure 1 clearly show how the masking occurs— unusual circumstances. This entry discusses the
the fitted line is barely changed when either of the identification and treatment of influential data
points 21 or 22 is individually removed from the points.
data set. These points exert little individual influ-
ence, but their joint influence is extreme.
Identifying Influential Data Points
Michael A. Martin and Steven Roberts A variety of straightforward approaches is avail-
able to identify influential data points on the basis
See also Data Cleaning; Data Mining; Outlier; SPSS
of their leverage, outlying response values, or indi-
vidual effect on regression coefficients.
Further Readings
Atkinson, A. C., & Riani, M. (2000). Robust diagnostic
regression analysis. New York: Springer-Verlag. Graphical Methods
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980).
In the case of simple linear regression (p = 2),
Regression diagnostics: Identifying influential data
a contingency plot of the response versus predic-
and sources of collinearity. New York: Wiley.
Chatterjee, S., & Hadi, A. S. (1986). Influential tor values might disclose influential observa-
observations, high leverage points, and outliers in tions, which will fall well outside the general
linear regression. Statistical Science, 1, 379–393. two-dimensional trend of the data. Observations
Cook, R. D. (1977). Detection of influential observation with high leverage as a result of the joint effects
in linear regression. Technometrics, 19, 15–18. of multiple explanatory variables, however, are
Cook, R. D. (1979). Influential observations in linear difficult to reveal by graphical means. Although
regression. Journal of the American Statistical simple graphing is effective in identifying
Association, 74, 169–174. extreme outliers and nonsensical values, and is
Cook, R. D., & Weisberg, S. (1982). Residuals and
valuable as an initial screen, the eyeball might
influence in regression. New York: Chapman & Hall.
not correctly discern less obvious influential
points, especially when the data are sparse (i.e.,
small nÞ.
INFLUENTIAL DATA POINTS
Leverage
Influential data points are observations that exert
an unusually large effect on the results of regres- Observations whose influence is derived from
sion analysis. Influential data might be classified as explanatory values are known as leverage points.
outliers, as leverage points, or as both. An outlier The leverage of the ith observation is defined as
is an anomalous response value, whereas a leverage hi ¼ xi ðX0 XÞ1 x0 i , where xi is the ith row of the
point has atypical values of one or more of the n × p design matrix X for p predictors and sample
predictors. It is important to note that not all out- size n: Larger values of hi ; where 0 ≤ hi ≤ 1, are
liers are influential. indicative of greater leverage. For reasonably large
Identification and appropriate treatment of data sets (n p > 50), a value of hi greater than
influential observations are crucial in obtaining 2p/n is a standard criterion
P for classification as
a valid descriptive or predictive linear model. a leverage point, where ni¼ 1 hi ¼ p and thus the
A single, highly influential data point might mean of hi ¼ p / n:
Influential Data Points 601
mentally handicapped, etc.). Justice ensures rea- Third, a description of any benefits to the
sonable, nonexploitative, and carefully considered subjects or others that are expected should be
procedures through fair administration. explained. Benefits include scientific knowledge;
As a result of the Belmont Report, six norms personally relevant benefits for the participants
were determined for conducting research: valid (i.e., food, money, and medical/mental health ser-
research designs, competence of researcher, identi- vices); insight, training, learning, role modeling,
fication of consequences, selection of subjects, vol- empowerment, and future opportunities; psycho-
untary informed consent, and compensation for social benefits (i.e., altruism, favorable attention,
injury. Each norm coexists with the others to and increased self-esteem); kinship benefits (i.e.,
ensure participants’ safety and should be followed closeness to people or reduction of alienation);
by researchers when formulating and implement- and community benefits (i.e., policies and public
ing a research project. documentation).
Several revisions were made to the ethics code Fourth, descriptions of alternatives to partici-
as time progressed. More recently, the Common pate must be provided to potential participants.
Rule was created and applied. The Common Rule This provides additional resources to people who
established the following three main protective fac- are being recruited.
tors: review of research by an IRB, institutional The fifth requirement is a description of how
assurances of compliance, and informed consent of confidentiality or anonymity will be ensured and
participants. The Common Rule is also still used its limits. Anonymity can be ensured in several
today. ways. Examples include using numbers or code
names instead of the names of the participants.
The specifics of the study will likely determine
how confidentiality will be ensured.
Required Components
Sixth, if the research will have more than mini-
Generally, three conditions must be met for mal risk, law requires a statement of whether
informed consent to be considered valid—the par- compensation for injury will be provided. If com-
ticipants must understand the information pre- pensation will be provided, a description of how
sented, the consent must be given voluntary, and should be included.
the participant must be competent to give consent. The seventh requirement is to provide the con-
More specifically, federal law requires eight com- tact information of the individual(s) the partici-
ponents be included in the consent statement. pants can contact if they have any questions or if
First, an explanation of the purpose of the there was the event of harm.
research, the expected duration of the subject’s Eighth, a statement will be made that participa-
participation, and a description of the procedure tion is voluntary and that if one chooses not to
must be included. Details of the methods are not participate, there will be no penalty or loss. In
required and are actually discouraged to allow addition, it must be explained that if one chooses
clearer comprehension on the part of the partici- to participate, then leaving the study at any time is
pant. Jargon, legal terminology, and irrelevant acceptable and there would be no penalty.
information should not be included. If deception is Last, the participants should receive a copy of
necessary, then participants must be informed that the informed consent to keep.
the details of the study cannot be explained prior
to the study, and that they will be given a full
Other Considerations
explanation of the study upon completion. Addi-
tional information on the use of deception is dis- There are several elements that can be added to an
cussed later. informed consent form to make it more effective,
Second, any description of foreseeable risk or although these items are not required. Examples
discomfort should be explained. Risk implies that include the following: any circumstances that
harm, loss, or damages might occur. This can might warrant termination of a participant regard-
include mere inconvenience or physical, psycholog- less of consent, additional costs the participants
ical, social, economic, and legal risks. might experience, the procedure if a participant
604 Informed Consent
decided to leave the study and its consequences, required criteria of an informed consent were read
and developments of the study. orally to the participant or the participants’ legally
Overall, an effective consent statement should authorized representative. The IRB must also
be jargon free, easy to understand, and written in approve a written summary of what will be said
a friendly, simple manner. A lengthy description of orally to the potential participants. Only the short
the methods is not necessary, and any irrelevant form will be signed by the participant. In addition,
information should not be included. As discussed a witness must sign the short form and the sum-
previously, each legal requirement should be mary of what was presented. A copy of the short
included as well. form and the written summary should be provided
In addition to the content and the style in which to the participant. Whichever way the material is
the informed consent is written, the manner in presented, the individual should be provided ade-
which the material is presented can increase or quate time to consider the material before signing.
decrease participation. Establishing good rapport Behavioral consent occurs when the consent
is very important and might require specific atten- form is waived or exempt. These situations are dis-
tion, as presenting the informed consent might cussed later.
become mundane if repeated a great deal. Using
a friendly greeting and tone throughout the pro-
Special Cases
cess of reading the informed consent is important.
Using body language that displays openness will Several cases require special considerations in
be helpful as well. A lack of congruence between addition to the required components of informed
what is verbalized and displayed through body consent. These special cases include minors, indivi-
language might lead potential participants to feel duals with disabilities, language barriers, third par-
uncomfortable. Furthermore, using an appropriate ties, studies using the Internet for collection, and
amount of eye contact will help create a friendly the use of deception in research.
atmosphere as well. Too little or too much eye To protect children and adolescent research par-
contact could potentially be offensive to certain ticipants, safeguards are put into place. Children
individuals. Presenting a willingness to answer all might be socially, cognitively, or psychologically
concerns and questions is important as well. Over- immature, and therefore, cannot provide informed
all, potential participants will better trust research- consent. In 1983, the Department of Health and
ers who present themselves in a friendly, caring Human Services adopted a federal regulation gov-
manner and who create a warm atmosphere. erning behavioral research on persons under the
age of 18. The regulations that were put into place
include several components. First, an IRB approval
Methods of Obtaining Informed Consent
must be obtained. Next, the documented permis-
There are several methods in which consent can sion of one parent or guardian and the assent of
be obtained. Largely, consent is acquired through the child must be obtained. Assent is the child’s
written (signed) consent. Oral and behavioral affirmative agreement to participate in the study.
consent are other options that are used less A lack of objection is not enough to assent. The
commonly. standard for assent is the child’s ability to under-
In most cases, the IRB will require a signed con- stand the purpose and what will occur if one
sent form. A signed consent form provides proof chooses to participate. In the case of riskier
that consent was indeed obtained. In riskier stud- research, both parents’ permission must be
ies, having a witness sign as well can provide extra obtained. Furthermore, the research must
assurance. involve no greater risk than the child normally
The actual written consent form can take two encounters, unless the risk is justified by antici-
forms—one that contains each required element pated benefits to the child.
outlined previously or a short written consent doc- Adapting the assent process with young chil-
ument. If the full version is presented, the form dren can lead to better comprehension. Minimiz-
can be read to or by the potential participants. ing the level of difficulty by using simple language
The short form entails documenting that the is effective to describe the study. After the
Informed Consent 605
presentation of the information, the comprehen- Internet and then sending informed consent forms
sion of the child should be assessed. Repeating the through the mail to obtain signatures. When the
material or presenting it in a story or video format signed copy is obtained, a code can be sent back to
can be effective as well. participate in the online study. Another suggestion
Another group that requires special consider- is using a button, where the participant has to
ation is individuals with disabilities. Assessing click ‘‘I agree’’ after reading the informed consent.
mental stability and illness as well as cognitive After giving consent, access to the next page would
ability is important to determine the participants’ be granted.
ability to make an informed decision. Moreover, The use of deception in research is another spe-
considerations to the degree of impairment and cial concern that is extremely controversial. By
level of risk are critical to ensure the requirements definition, using deception does not meet the crite-
of informed consent are met. Often, a guardian or ria of informed consent to provide full disclosure
surrogate will be asked to provide consent for par- of information about the study to be conducted.
ticipation in a study with disabled individuals. Deception in research includes providing inaccu-
Cultural issues also need consideration when rate information about the study, concealing
obtaining consent from individuals of different information, using confederates, making false
nationalities and ethnicities. Individuals who speak guarantees in regard to confidentiality, misrepre-
a language other than English might have difficulty senting the identity of the investigator, providing
understanding the material presented in the false feedback to the participants, using placebos,
informed consent. Special considerations should be using concealed recording devices, and failing to
made to address this issue. For instance, an inter- inform people they are part of a study. Proponents
preter can be used or the form could be tran- of deceptive research practice argue that deception
scribed to the native language of the potential provides useful information that could not other-
participant. Translators can reduce language bar- wise be obtained if participants were fully
riers significantly and provide an objective presen- informed. The American Psychological Association
tation of the information about the study. (APA) guidelines allow the use of deception with
Protecting third parties in research, or informa- specific regulations. The use of deception must be
tion obtained about other people from a partici- justified clearly by the prospective scientific value,
pant, is another special case. Although there are and other alternatives must be considered before
no guidelines currently in place, some recommen- using deception. Furthermore, debriefing must
dations exist. Contextual information that is occur no later than the end of data collection to
obtained from participants is generally not consid- explain the use of deception in the study and all
ered private. However, when information about the information fully that was originally withheld.
a third party becomes identifiable and is private,
an informed consent must be obtained.
With advances in technology, many studies are
Exceptions
being carried out through the Internet because of
the efficiency and low cost to researchers. Unfortu- There are cases in which an IRB will approve con-
nately, ethical issues, including informed consent, sent procedures with elements missing or with revi-
are hard to manage online. Researchers agree ethi- sions from the standard list of requirements, or in
cal guidelines are necessary, and some recommen- which they will waive the written consent entirely.
dations have been made, but currently there is no In general, the consent form can be altered or
standardized method of colleting and validating waived if it is documented that the research
informed consent online. Concerns about obtain- involves no more harm than minimal risk to the
ing consent online include being certain the partici- participants, the waiver or alteration will not
pant is of legal age to consent and that the adversely affect the rights and welfare of the sub-
material presented was understood. Maintaining jects, the research could not be practically carried
confidentiality with the use of e-mail and manag- out without the waiver or alteration, or the sub-
ing deception is difficult. Suggestions to these jects will be provided with additional pertinent
issues include recruiting participants via the information after participating.
606 Informed Consent
Alterations and waivers can also be made if it is Parental permission can also be waived under
demonstrated that the research will be conducted two circumstances. Consent can be waived for
or is subject to the approval of state or local research involving only minimal risk, given that
government officials, and it is designed to study, the research will not affect the welfare of the parti-
evaluate, or examine (a) public benefit service pro- cipants adversely, and the research can be carried
grams, (b) procedures for obtaining benefits or ser- out practically without a waiver. For instance, chil-
vices under these programs, (c) possible changes in dren who live on the streets might not have par-
or alternatives to those programs or procedures, or ents who could provide consent. In addition,
(d) possible changes in methods or level of pay- parental permission can also be waived if they do
ment for benefits or services under those programs. not properly protect the child.
Furthermore, under the same conditions listed pre- The National Commission for the Protection
viously, required elements can be left out if the of Human Subjects of Biomedical and Behav-
research could not be practically carried out with- ioral Research identified four other cases in
out the waiver or alteration. which a waiver for parental consent could poten-
The IRB can also waive the requirement for the tially occur. Research designed to study factors
researcher to obtain a signed consent form in two related to the incidence or treatment of condi-
cases. First, if the only record linking the partici- tions in adolescents who could legally receive
pant and the research would be the consent form treatment without parental permission is one
and the primary risk would be potential harm case. The second is participants who are mature
from a breach of confidentiality, then a signed con- minors and the procedures involve no more risk
sent form can be waived. Each participant in this than usual. Third, research designed to study
case should be provided the choice of whether he neglected or abused children, and fourth,
or she would like documentation linking him or research involving children whose parents are
her to the research. The wishes of the participant not legally or functionally competent, does not
should then be followed. Second, the research require parental consent.
to be conducted presents no more than minimal Child assent can also be waived if the child is
risk to the participants and involves no procedures deemed incapable of assenting, too young, imma-
that normally require a written consent outside of ture, or psychologically unstable, or if obtaining
a research context. In each case, after approving assent would hinder the research possibilities. The
a waiver, the IRB could require the researcher to IRB must approve the terms before dismissing
provide participants with a written statement assent.
explaining the research that will be conducted. There are several cases in which research is
Observational studies, ethnographic studies, exempted from the IRB and, therefore, does not
survey research, and secondary analysis can all require the consent of the parent(s). One of the
waive informed consent. In observational stud- most common cases is if research is conducted
ies, a researcher observes the interaction of in commonly accepted educational settings that
a group of people as a bystander. If the partici- involve normal educational practices. Examples of
pants remain anonymous, then informed consent normal practices include research on instructional
can be waived. An ethnographic study involves strategies and research on the effectiveness of, or
the direct observation of a group through an the comparison among, instructional technique,
immersed researcher. Waiving consent in ethno- curricula, or classroom management. Another
graphic studies depends on the case and vulnera- common example is if research involves educa-
bility of the participants. In conducting survey tional tests (i.e., cognitive, diagnostic, aptitude, or
research, if the participant can hang up the achievement), or if the information from these
phone or throw away mail, then consent is likely tests cannot be identified to the participants.
not needed. If a survey is conducted in person
and the risk is minimal to the participant, then Rhea L. Owens
consent can be waived as well. Furthermore,
informed consent does not have to be obtained See also Assent; Interviewing; Observations; Participants;
for secondary analysis of data. Recruitment
Instrumentation 607
consistency enables investigators to gain confi- to measure because the performance is low for all
dence in the measuring ability or dependability of subjects. Measurement errors can also take place
the particular instrument. Approaches to reliability in a random fashion. In this case, for example, if
consist of repeated measurements on an individual a math achievement test is reliable and if a student
(i.e., test–retest and equivalent forms), internal has been projected to score 70 based on her previ-
consistency measures (i.e., split-half, Kuder– ous performance, then investigators would expect
Richardson 20, Kuder–Richardson 21, and test scores of this student to be close to the pro-
Cronbach’s alpha), and interrater and intrarater jected score of 70. After the same examination
reliability. Usually, reliability is shown in the was administered on several different occasions,
numerical form, as a coefficient. The range of reli- the scores obtained (e.g., 68, 71, and 72) might
ability coefficient is from 0 (errors existed in the not be the exact projected score—but they are
entire measurement) to 1 (no error in the measure- pretty close. In this case, the differences in test
ment was discovered); the higher the coefficient, scores would be caused by random variation. Con-
the better the reliability. versely, of course, if the test is not reliable, then
considerable fluctuations in terms of test scores
would not be unusual. In fact, any values or scores
Measurement Errors
obtained from such an instrument would be, more
Investigators need to attempt to minimize mea- or less, affected by random errors, and researchers
surement errors whenever practical and possible can assume that no instrument is totally free from
for the purpose of accurately indicating the random errors. It is imperative to note that a valid
reported values collected by the instrument. Mea- instrument must have reliability. An instrument
surement errors can occur for various reasons and can, however, be reliable but invalid—consistently
might result from the conditions of testing (e.g., measuring the wrong thing.
test procedure not properly followed, testing site Collectively, instrumentation involves the whole
too warm or too cold for subjects to calmly process of instrument development and data col-
respond to the instrument, noise distractions, or lection. A good and responsible research effort
poor seating arrangements), from characteristics of requires investigators to specify where, when, and
the instrument itself (e.g., statements/questions not under what conditions the data are obtained to
clearly stated, invalid instruments of measuring the provide scientific results and to facilitate similar
concept in question, unreliable instruments, or research replications. In addition to simply indicat-
statements/questions too long), from test subjects ing where, when, and under what conditions the
themselves (e.g., socially desirable responses pro- data are obtained, the following elements are part
vided by subjects, bogus answers provided by sub- of the instrumentation concept and should be
jects, or updated or correct information not clearly described and disclosed by investigators:
possessed by subjects), or combinations of these how often the data are to be collected, who will
listed errors. Pamela L. Alreck and Robert B. Settle collect the data, and what kinds of data-collection
refer to the measurement errors described previ- methods are employed. In summary, instrumenta-
ously as instrumentation bias and error. tion is a term referring to the process of identifying
Concerning the validity and reliability of instru- and handling the variables that are intended to be
mentation, measurement errors can be both sys- measured in addition to describing how investiga-
tematic and random. Systematic errors have an tors establish the quality of the instrumentation
impact on instrument validity, whereas random concerning validity and reliability of the proposed
errors affect instrument reliability. For example, if measures, how to minimize measurement errors,
a group of students were given a math achieve- and how to proceed in the process of data
ment test and the test was difficult to all exami- collection.
nees, then all test scores would be systematically
lowered. These lowered scores indicate that the
Instrumentation as a Threat to Internal Validity
validity of the math achievement test is low for
that particular student group or, in other words, As discussed by Donald T. Campbell and Julian
the instrument does not measure what it purports Stanley in Experimental and Quasi-Experimental
Instrumentation 609
Designs for Research, instrumentation, which is presenting ‘‘leading’’ questions to the persons being
also named instrument decay, is one of the threats interviewed, allowing some subjects to use more
to internal validity. It refers to changes in calibra- time than others to complete a test, or screening or
tion of a measuring instrument or changes in per- editing sensitive issues or comments by those col-
sons collecting the data that can adversely lecting the data. The primary controls to this threat
generate differences in the data gathered thereby are to standardize the measuring procedure and to
affecting the internal validity of a study. This keep data collectors ‘‘blind.’’ Principal investigators
threat can result from data-collector characteris- need to provide training and standardized guide-
tics, data-collector bias, and the decaying effect. lines to make sure that data collectors are aware of
Accordingly, certain research designs are suscepti- the importance of measurement consistency within
ble to this threat. the process of data collection. With regard to keep-
ing data collectors ‘‘blind,’’ principal investigators
need to keep data collectors ignorant of which
Data Collector Characteristics
method individual subjects or groups (e.g., control
The results of a study can be affected by the group vs. experimental group) are being tested or
characteristics of data collectors. When more than observed in a research effort.
two data collectors are employed as observers,
scorers, raters, or recorders in a research project, Decaying Effect
a variety of individual characteristics (e.g., gender,
age, working experience, language usage, and eth- When the data generated from an instrument
nicity) can interject themselves into the process and allow various interpretations and the process of
thereby lead to biased results. For example, this sit- handling those interpretations are tedious and/or
uation might occur when the performance of difficult requiring rigorous discernment, an investi-
a given group is rated by one data collector while gator who scores or needs to provide comments on
the performance of another group is collected by these instruments one after another can eventually
a different person. Suppose that both groups per- become fatigued, thereby leading to scoring differ-
form the task equally well per the performance cri- ences. A change in the outcome or conclusion
teria. However, the score of one group is supported by the data supports has now been intro-
significantly higher than that of the other group. duced by the investigator, who is an extraneous
The difference in raters would be highly suspect in source not related to the actual collected data. A
causing the variations in measured performance. common example would be that of an instructor
The principal controls to this threat are to use who attempts to grade a large number of term
identical data collector(s) throughout the data- papers. Initially, the instructor is thorough and
collection process, to analyze data separately for painstaking on his or her assessment of perfor-
every data collector, to precalibrate or make cer- mance. However, after grading many papers, tired-
tain that every data collector is equally skilled in ness, fatigue, and clarity of focus gradually factor in
the data collection task, or ensure that each rater and influence his or her judgments. The instructor
has the opportunity to collect data from each then becomes more generous on scoring the second
group. half of the term papers. The principal control to this
threat is to arrange several data-collection or grad-
ing sessions to keep the scorer calm, fresh, and
Data Collector Bias mentally acute while administering examinations or
It is possible that data collectors might uncon- grading papers. By doing so, the decaying effect
sciously treat certain subjects or groups differently that leads to scoring differences can be minimized.
than the others. The data or outcome generated
under such conditions would inevitably produce Instrumentation as a
biased results. Data collector bias can occur regard-
Threat to Research Designs
less of how many investigators are involved in the
collection effort; a single data-collection agent is Two quasi-experimental designs, the time-series
subject to bias. Examples of the bias include and the separate-sample pretest–posttest designs,
610 Interaction
and one preexperimental design, the one-group use instrumentation to collect hard data or mea-
pretest–posttest group design, are vulnerable to the surements of the real world, whereas research in
threat of instrumentation. The time-series design is the social sciences produces ‘‘soft’’ data that only
an elaboration of a series of pretests and posttests. measures perceptions of the real world. One rea-
For reasons too numerous to include here, data col- son that instrumentation is complicated is because
lectors sometimes change their measuring instru- many variables act independently as well as inter-
ments during the process of data collection. If this is act with each other.
the case, then instrumentation is introduced and any
main effect of the dependent variable can be misread Chia-Chien Hsu and Brian A. Sandford
by investigators as the treatment effect. Instrumenta-
See also Internal Validity; Reliability; Validity of
tion can also be a potential threat to the separate-
Measurement
sample pretest–posttest design. Donald T. Campbell
and Julian Stanley note that differences in attitudes
and experiences of a single data collector could be Further Readings
confounded with the variable being measured. That
is, when a data collector has administered a pretest, Alreck, P. L., & Settle, R. B. (1995). The survey research
handbook: Guidelines and strategies for conducting
he or she would be more experienced in the posttest
a survey. New York: McGraw-Hill.
and this difference might lead to variations in mea- Campbell, D. T., & Stanley, J. C. (1963). Experimental
surement. Finally, the instrumentation threat can be and quasi-experimental designs for research. Chicago:
one of the obvious threats often realized in the one- Rand McNally.
group pretest–posttest group design (one of the pre- Fraenkel, J. R., & Wallen, N. E. (2000). How to design
experimental designs). This is a result of the six and evaluate research in education. Boston: McGraw-
uncontrolled threats to internal validity inherent Hill.
with this design (i.e., history, maturation, testing, Gay, L. R. (1992). Educational research: Competencies
instrumentation, regression, interaction of selection, for analysis and application. Upper Saddle River, NJ:
and maturation). Therefore, with only one interven- Prentice Hall.
tion and the pre- and posttest design, there is
a greater chance of being negatively affected by data
collector characteristics, data collector bias, and
decaying effect, which can produce confounded INTERACTION
results. The effect of the biases are difficult to pre-
dict, control, or identify in consideration of any In most research contexts in the biopsychosocial
effort to separate actual treatment effects with the sciences, researchers are interested in examining
influence of these extraneous factors. the influence of two or more predictor variables
on an outcome. For example, researchers might be
interested in examining the influence of stress
Final Note
levels and social support on anxiety among first-
In the field of engineering and medical research, semester graduate students. In the current exam-
the term instrumentation is frequently used and ple, there are two predictor variables—stress levels
refers to the development and employment of and social support—and one outcome variable—
accurate measurement, analysis, and control. Of anxiety. In its simplest form, a statistical interac-
course, in the fields mentioned previously, instru- tion is present when the association between a
mentation is also associated with the design, con- predictor and an outcome varies significantly as
struction, and maintenance of actual instruments a function of a second predictor. Given the current
or measuring devices that are not proxy measures example, one might hypothesize that the associa-
but the actual device or tool that can be manipu- tion between stress and anxiety varies significantly
lated per its designed function and purpose. Com- as a function of social support. More specifically,
paring the devices for measurement in engineering one might hypothesize that there is no association
with the social sciences, the latter is much less pre- between stress and anxiety among individuals
cise. In other words, the fields of engineering might reporting higher levels of social support while
Interaction 611
Anxiety
stress and social support in predicting anxiety. 5
Hypothetical data consistent with this interac-
4
tion are presented in Figure 1. The horizontal axis
3
is labeled Stress, with higher values representing High Social Support
2
higher levels of stress. The vertical axis is labeled
1
Anxiety, with higher values representing higher
0
levels of anxiety. In figures such as these, one pre- 0 1 2 3 4 5 6 7 8 9 10
dictor (in this case, stress) is plotted along the hori- Stress
zontal axis, while the outcome (in this case,
anxiety) is plotted along the vertical axis. The sec-
Figure 1 The Interaction of Stress and Support in
ond predictor (in this case, social support) forms
Predicting Anxiety
the lines in the plot. In Figure 1, the flat line is
labeled High Social Support and represents the
association between stress and anxiety for indivi- social support or that support moderates the
duals reporting higher levels of social support. The stress–anxiety association. The term moderator is
other line is labeled Low Social Support and repre- commonly used in various fields in the social
sents the association between stress and anxiety sciences. Researchers interested in testing hypothe-
for individuals reporting lower levels of social sup- ses involving moderation are interested in testing
port. In plots like Figure 1, as the lines depart from statistical interactions that involve the putative
parallelism, a statistical interaction is suggested. moderator and at least one other predictor. In
plots like Figure 1, the moderator variable will
often be used to form the lines in the plot while
Terminological Clarity the remaining predictor is typically plotted along
Researchers use many different terms to discuss the horizontal axis.
statistical interactions. The crux issue in describing
statistical interactions has to do with dependence. Statistical Models
The original definition presented previously stated
that a statistical interaction is present when the Statistical interactions can be tested using many
association between a predictor and an outcome different analytical frameworks. For example,
varies significantly as a function of a second pre- interactions can be tested using analysis of vari-
dictor. Another way of stating this is that the ance (ANOVA) models, multiple regression mod-
effects of one predictor on an outcome depend on els, and/or logistic regression models—just to
the value of a second predictor. Figure 1 is a great name a few. Next, we highlight two such modeling
illustration of such dependence. For individuals frameworks—ANOVA and multiple regression—
who report higher levels of social support, there is although for simplicity, the focus is primarily on
no association between stress and anxiety. For ANOVA.
individuals who report lower levels of social sup-
port, there is a strong positive association between
Analysis of Variance
stress and anxiety. Consequently, the association
between stress and anxiety depends on level of The ANOVA model is often used when predic-
social support. Some other terms that researchers tor variables can be coded as finite categorical
use to describe statistical interactions are (a) condi- variables (e.g., with two or three categories) and
tional on, (b) contingent on, (c) modified by, and/ the outcome is continuous. Social science research-
or (d) moderated by. Researchers might state that ers who conduct laboratory-based experiments
the effects of stress on anxiety are contingent on often use the ANOVA framework to test research
612 Interaction
hypotheses. In the ANOVA framework, predictor Table 1 Data From a Hypothetical Laboratory-Based
variables are referred to as independent variables Study Examining the Effects of Stress and
or factors, and the outcome is referred to as the Social Support on Anxiety
dependent variable. The simplest ANOVA model Support
that can be used to test a statistical interaction
includes two factors—each of which has two cate- Low High
gories (referred to as levels in the ANOVA vernac- Stress Low 4 2 3
ular). Next, an example is presented with High 10 6 8
hypothetical data that conform to this simple 7 4
structure. This example assumes that 80 partici-
pants were randomly assigned to one of four con-
ditions: (1) low stress and low support, (2) low the mean anxiety score among individuals in the
stress and high support, (3) high stress and low high-stress/low-support condition. Means are pre-
support, and (4) high stress and high support. In sented also in the margins of the tables (i.e., the
this hypothetical study, one can assume that stress underlined numbers). The marginal means are the
was manipulated by exposing participants to either means of the two relevant row or column entries.
a simple (low stress) or complex (high stress) cog- The main effect of a factor is examined by compar-
nitive task. One can assume also that a research ing the marginal means across the various levels of
confederate was used to provide either low or high the factor. In the current example, we assume that
levels of social support to the relevant study partic- any (nonzero) difference between the marginal
ipant while she or he completed the cognitive task. means is equivalent to a main effect for the factor in
question. Based on this assumption, the main effect
of stress in Table 1 is significant—because there is
Main Effects and Simple Effects
a difference between the two stress marginal
A discussion of interactions in statistical texts means of 3 (for the low-stress condition) and 8
usually involves the juxtaposition of two kinds of (for the high-stress condition). As might be
effects: main effects and simple effects (also expected, on average individuals exposed to the
referred to as simple main effects). Researchers high-stress condition reported higher levels of
examining main effects (in the absence of interac- anxiety than did individuals exposed to the low-
tions) are interested in the unique independent stress condition. Similarly, the main effect of
effect of each of the predictors on the outcome. In support is also significant because there is a dif-
these kinds of models—which are often referred to ference between the two support marginal means
as additive effects models—the effect of each pre- of 7 (for the low-support condition) and 4 (for
dictor on the outcome is constant across all levels the high-support condition).
of the remaining predictors. In sharp contrast, In the presence of a statistical interaction,
however, in examining models that include interac- however, the researcher’s attention turns away
tions, researchers are interested in exploring the from the main effects of the factors and instead
possibility that the effects of one predictor on the focuses on the simple effects of the factors. As
outcome depend on another predictor. Next, this described previously, in the data presented in
entry examines both main and interactive effects Table 1 the main effect of support are quantified
in the context of the hypothetical laboratory-based by comparing the means of individuals who
experiment. received either low or high levels of social sup-
In Table 1, hypothetical data from the laboratory- port. A close examination of Table 1 makes clear
based study are presented. The numbers inside the that scores contributing to the low-support mean
body of the table are means (i.e., arithmetic derive from the following two different sources:
averages) on the dependent variable (i.e., anxiety). (1) individuals who were exposed to a low-stress
Each mean is based on a distinct group of 20 partici- cognitive task and (2) individuals who were
pants who were exposed to a combination of the exposed to a high-stress cognitive task. The same
stress (e.g., low) and support (e.g., low) factors. For is true of scores contributing to the high-support
example, the number 10 in the body of the table is mean. In the current example, however, combining
Interaction 613
data from individuals exposed to either lower or comparing the mean of the low-support/low-
higher levels of stress does not seem prudent. In stress group (4) to the mean of the low-support/
part, this is true because the mean anxiety score high-stress group (10) is testing the simple effect
for individuals exposed to high support varies as of stress at low levels of support. If there is
a function of stress. In other words, when holding a (nonzero) difference between these two means,
support constant at high levels, on average indivi- we will assume that the simple effect of stress at
duals exposed to the low-stress task report lower lower levels of support is statistically significant.
levels of anxiety (2) than do individuals exposed to Consequently, in the current case, this simple
the high-stress task (6). Even more important, the effect is significant (because 10 4 ¼ 6). In
stress effect is more pronounced at lower levels of examining the simple effects of stress at high
support because the average anxiety difference levels of support, we would compare the high-
between the low-stress (4) and high-stress (10) support/low-stress mean (2) with the high-
conditions is larger. In other words, the data pre- support/high-stress mean (6). We would conclude
sented in Table 1 suggest that the effects of stress that the simple effect of stress at high levels of sup-
on anxiety depend on the level of support. Another port is also significant (because 6 2 ¼ 4).
way of saying this is that there is a statistical inter- The test of the interaction effect is quantified
action between stress and support in predicting by examining whether the relevant simple effects
anxiety. are different from one another. If the difference
As noted previously, in exploring interactions, between the two simple effects of stress—one at
researchers focus on simple rather than main lower levels of support and the second at higher
effects. Although the term was not used, two of levels of support are compared—it will be found
the four simple effects yielded by the hypothetical that the interaction is significant (because 4
study have already been discussed. When examin- 6 ¼ 2). The fact that these two simple effects
ing simple effects, the researcher contrasts the differ quantifies numerically the original concep-
table means in one row or column of the table. In tual definition of a statistical interaction, which
doing so, the researcher is examining the simple stated that in its simplest form a statistical inter-
effects of one of the factors at a specific level (i.e., action is present when the association between
value) of the other factor. a predictor and an outcome varies significantly
Previously, when we observed that there was as a function of a second predictor. In the cur-
a difference in the mean anxiety scores for indi- rent case, the association between stress and
viduals who received a low level of social sup- anxiety varies significantly as a function of sup-
port under either the low-stress or high-stress port. At higher levels of support, the (simple)
conditions, we were discussing the simple effect effect of stress is more muted, resulting in a mean
of stress at low levels of support. In other words, anxiety difference of 4 between the low-stress
614 Interaction
and high-stress conditions. At lower levels of impetus for their work derived from researchers’
support, however, the (simple) effect of stress is lack of understanding of how to specify and inter-
more magnified, resulting in a mean anxiety dif- pret multiple regression models properly, including
ference of 6 between the low-stress and high- tests of interactions. In some of the more com-
stress conditions. Consequently, the answer to monly used statistical software programs, ANOVA
the question ‘‘What is the effect of stress on anx- models are typically easier to estimate because the
iety?’’ is ‘‘It depends on the level of support.’’ actual coding of the effects included in the analysis
(Table 2 provides a generic 2 × 2 table in which occurs ‘‘behind the scenes.’’ In other words, if
all of the various main and simple effects are a software user requested a full factorial ANOVA
explicitly quantified. The description of the model (i.e., one including all main effects and
Table 1 entries as well as the presentation of interactions) to analyze the data from the hypo-
Table 2 should help the reader gain a better thetical laboratory-based study described previ-
understanding of these various effects.) ously, the software would create effect codes to
specify the stress and support predictors and
would also form the product of these codes to
Multiple Regression
specify the interaction predictor. The typical user,
In 1968, Jacob Cohen shared some of his however, is probably unaware of the coding that is
insights with the scientific community in psychol- used to create the displayed output. In specifying
ogy regarding the generality and flexibility of the a multiple regression model to analyze these
multiple regression approach to data analysis. data, the user would not be spared the work of
Within this general analytical framework, the coding the various effects in the analysis. More
ANOVA model exists as a special case. In the more important, the user would need to understand the
general multiple regression model, predictor vari- implications of the various coding methods for the
ables can take on any form. Predictors might be proper interpretation of the model estimates. This
unordered categorical variables (e.g., gender: male is one reason why the text by Aiken and West has
or female), ordered categorical variables (e.g., received so much positive attention. The text out-
symptom severity: low, moderate, or high), and/or lines—through the use of illustrative examples—
truly continuous variables (e.g., chronological age). various methods used to test and interpret multiple
Similarly, interactions between and/or among pre- regression models including tests of interaction
dictor variables can include these various mixtures effects. It also includes a detailed discussion of the
(e.g., a categorical predictor by continuous predic- proper interpretation of the various conditional
tor interaction or a continuous predictor by contin- (i.e., simple) effects that are components of larger
uous predictor interaction). interactions.
Many of the same concepts described in the
context of ANOVA have parallels in the regression
framework. For example, in examining an interac-
tion between a categorical variable (e.g., graduate
Additional Considerations
student cohort: first year or second year) and a
continuous variable (e.g., graduate school–related This discussion endeavored to provide a brief and
stress) in predicting anxiety, a researcher might nontechnical introduction to the concept of statis-
examine the simple slopes that quantify the associ- tical interactions. To keep the discussion more
ation between stress and anxiety for each of the accessible, equations for the various models
two graduate school cohorts. In such a model, the described were not provided. Moreover, this dis-
test of the (cohort by stress) interaction is equiva- cussion focused mostly on interactions that were
lent to testing whether these simple slopes are sig- relatively simple in structure (e.g., the laboratory-
nificantly different from one another. based example, which involved an interaction
In 1991, Leona S. Aiken and Stephen G. West between two 2-level categorical predictors). Before
published their seminal work on testing, interpret- concluding, however, this entry broaches some
ing, and graphically displaying interaction effects other important issues relevant to the discussion of
in the context of multiple regression. In part, the statistical interactions.
Internal Consistency Reliability 615
Interactions Can Include Many Variables See also Analysis of Variance (ANOVA); Effect Coding;
Factorial Design; Main Effects; Multiple Regression;
All the interactions described previously involve Simple Main Effects
the interaction of two predictor variables. It is pos-
sible to test interactions involving three or more
predictors as well (as long as the model can be Further Readings
properly identified and estimated). In the social Aiken, L. S., & West, S. G. (1991). Multiple regression:
sciences, however, researchers rarely test interac- Testing and interpreting interactions. Newbury Park,
tions involving more than three predictors. In test- CA: Sage.
ing more complex interactions the same core Baron, R. M., & Kenny, D. A. (1986). The moderator-
concepts apply—although they are generalized to mediator variable distinction in social psychological
include additional layers of complexity. For exam- research: Conceptual, strategic, and statistical
ple, in a model involving a three-way interaction, considerations. Journal of Personality and Social
seven effects comprise the full factorial model (i.e., Psychology, 51, 1173–1182.
three main effects; three 2-way interactions; and Cohen, J. (1968). Multiple regression as a general data-
analytic system. Psychological Bulletin, 70, 426–443.
one 3-way interaction). If the three-way interac-
Cohen, J. (1978). Partialed products are interactions;
tion is significant, it suggests that the simple two- Partialed powers are curve components. Psychological
way interactions vary significantly as a function of Bulletin, 85, 858–866.
the third predictor. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Applied multiple regression/correlation analysis for the
When Is Testing an Interaction Appropriate? behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
This discussion thus far has focused nearly Keppel, G. (1991). Design and analysis: A researcher’s
exclusively on understanding simple interactions handbook (3rd ed.). Englewood Cliffs, NJ: Prentice
from both conceptual and statistical perspectives. Hall.
When it is appropriate to test statistical interac-
tions has not been discussed. As one might imag-
ine, the answer to this question depends on many
factors. A couple of points are worthy of mention, INTERNAL CONSISTENCY
however. First, some models require the testing of RELIABILITY
statistical interactions in that the models assume
that the predictors in question do not interact.
Internal consistency reliability estimates how much
For example, in the classic analysis of covariance
total test scores would vary if slightly different
(ANCOVA) model in which one predictor is trea-
items were used. Researchers usually want to mea-
ted as the predictor variable of primary theoretical
sure constructs rather than particular items. There-
interest and the other predictor is treated as a
fore, they need to know whether the items have
covariate (or statistical control variable), the
a large influence on test scores and research
model assumes that the primary predictor and cov-
conclusions.
ariate do not interact. As such, researchers
This entry begins with a discussion of classical
employing such models should test the relevant
reliability theory. Next, formulas for estimating
(predictor by covariate) interaction as a means of
internal consistency are presented, along with a dis-
assessing empirically this model assumption. Sec-
cussion of the importance of internal consistency.
ond, as is true in most empirical work in the
Last, common misinterpretations and the interac-
sciences, theory should drive both the design of
tion of all types of reliability are examined.
empirical investigations and the statistical analyses
of the primary research hypotheses. Consequently,
researchers can rely on the theory in a given area Classical Reliability Theory
to help them make decisions about whether to
hypothesize and test statistical interactions. To examine reliability, classical test score theory
divides observed scores on a test into two compo-
Christian DeLucia and Brandon Bergman nents, true score and error:
616 Internal Consistency Reliability
X ¼ T þ E; σ 2X ¼ σ 2T þ σ 2E ;
where X = observed score, T = true score, and where σ 2E = the variance of error scores across
E = error score. participants.
If Steve’s true score on a math test is 73 but he The reliability coefficient can now be rewritten
gets 71 on Tuesday because he is tired, then his as follows:
observed score is 71, his true score is 73, and his
error score is –2. On another day, his error score σ 2T σ 2T
ρXX0 ¼ ¼ :
might be positive, so that he scores better than he σ 2X σ 2T þ σ 2E
usually would.
Each type of reliability defines true score and Reliability coefficients vary from 0 to 1, with
error differently. In test–retest reliability, true score higher coefficients indicating higher reliability.
is defined as whatever is consistent from one test- This formula can be applied to each type of reli-
ing time to the next, and error is whatever varies ability. Thus, internal consistency reliability is the
from one testing time to the next. In interrater reli- proportion of observed score variance that is
ability, true score is defined as whatever is con- caused by true differences between participants,
sistent from one rater to the next, and error is where true differences are defined as differences
defined as whatever varies from one rater to the that are consistent across the set of items. If the
next. Similarly, in internal consistency reliability, reliability coefficient is close to 1, then researchers
true score is defined as whatever is consistent from would have obtained similar total scores if they
one item to the next (or one set of items to the had used different items to measure the same
next set of items), and error is defined as whatever construct.
varies from one item to the next (or from one set
of items to the next set of items that were designed
Estimates of Internal Consistency
to measure the same construct). To state this
another way, true score is defined as the expected Several different formulas have been proposed to
value (or long-term average) of the observed estimate internal consistency reliability. Lee
scores—the expected value over many times (for Cronbach, Cyril Hoyt, and Louis Guttman inde-
test–retest reliability), many raters (for interrater pendently developed the most commonly used for-
reliability), or many items (for internal consis- mula, which is labeled coefficient alpha after the
tency). The true score is the average, not the truth. terminology used by Cronbach. The split-half
The error score is defined as the amount by which approach is also common. In this approach, the
a particular observed score differs from the aver- test is divided into two halves, which are then cor-
age score for that person. related. G. F. Kuder and M. W. Richardson devel-
Researchers assess all types of reliability using oped KR-20 for use with dichotomous items (i.e.,
the reliability coefficient. The reliability coefficient true/false items or items that are marked as correct
is defined as the ratio of true score variance to or incorrect). KR-20 is easy to calculate by hand
observed score variance: and has traditionally been used in classroom set-
tings. Finally, Tenko Raykov and Patrick Shrout
σ 2T have recently proposed measuring internal consis-
ρXX0 ¼ ;
σ 2X tency reliability using structural equation modeling
approaches.
where ρXX0 ¼ the reliability coefficient, σ 2T ¼ the
variance of true scores across participants, and
Importance of Internal Consistency
σ 2X ¼ the variance of observed scores across
participants. Internal consistency reliability is the easiest type of
Classical test score theory assumes that true reliability to calculate. With test–retest reliability,
scores and errors are uncorrelated. Therefore, the test must be administered twice. With interra-
observed variance on the test can be decomposed ter reliability, the test must be scored twice. But
into true score variance and error variance: with internal consistency reliability, the test only
Internal Consistency Reliability 617
needs to be administered once. Because of this, if different items were used. This question is
internal consistency is the most commonly used theoretically important because it tells researchers
type of reliability. whether they have covered the full breadth of the
Internal consistency reliability is important construct. But this question is usually not of practi-
when researchers want to ensure that they have cal interest, because researchers usually administer
included a sufficient number of items to capture the same items to all participants.
the concept adequately. If the concept is narrow,
then just a few items might be sufficient. For
Common Misinterpretations
example, the International Personality Item Pool
(IPIP) includes a 10-item measure of self-discipline Four misinterpretations of internal consistency are
that has a coefficient alpha of .85. If the concept is common. First, researchers often assume that if
broader, then more items are needed. For example, internal consistency is high, then other types of
the IPIP measure of conscientiousness includes 20 reliability are high. In fact, there is no necessary
items and has a coefficient alpha of .88. Because mathematical relationship between the variance
conscientiousness is a broader concept than self- caused by items, the variance caused by time, and
discipline, if the IPIP team measured conscientious- the variance caused by raters. It might be that
ness with just 10 items, then the particular items there is little variance caused by items but consid-
that were included would have a substantial effect erable variance caused by time and/or raters.
on the scores obtained, and this would be reflected Because each type of reliability defines true score
in a lower internal consistency. and error score differently, there is no way to pre-
Second, internal consistency is important if dict one type of reliability based on another.
a researcher administers different items to each Second, researchers sometimes assume that high
participant. For example, an instructor might use internal consistency implies unidimensionality.
a computer-administered test to assign different This misinterpretation is reinforced by numerous
items randomly to each student who takes an textbooks that state that the internal consistency
examination. Under these circumstances, the coefficient indicates whether all items measure the
instructor must ensure that students’ course grades same construct. However, Neal Schmitt showed
are mostly a result of real differences between the that a test can have high internal consistency even
students, rather than which items they were if it measures two or more unrelated constructs.
assigned. This is possible because internal consistency reli-
However, it is unusual to administer different ability is influenced by both the relationships
items to each participant. Typically, researchers between the items and the number of items. If all
compare scores from participants who completed items are related strongly to each other, then just
identical items. This is in sharp contrast with other a few items are sufficient to obtain high internal
forms of reliability. For example, participants are consistency. If items have weaker relationships or
often tested at different times, both within a single if some items have strong relationships and other
study and across different studies. Similarly, parti- items are unrelated, then high internal consistency
cipants across different studies are usually scored can be obtained by having more items.
by different raters, and sometimes participants Researchers often want to know whether a set
within a single study are scored by different raters. of items is unidimensional, because it is easier to
When differences between testing times or raters interpret test scores if all items measure the same
are confounded with differences between partici- construct. Imagine that a test contains 10 vocabu-
pants, researchers must consider the effect of this lary items and 10 math items. Jane scores 10 by
design limitation on their research conclusions. answering the 10 vocabulary items correctly; John
Because researchers typically only compare partici- scores 10 by answering the 10 math items cor-
pants who have completed the same items, this rectly; and Chris scores 10 by answering half of
limitation is usually not relevant to internal consis- the vocabulary and half of the math items cor-
tency reliability. rectly. All three individuals obtain the same score,
Thus, the internal consistency coefficient tells but these identical scores do not reflect similar
researchers how much total test scores would vary abilities. To avoid this problem, researchers often
618 Internal Consistency Reliability
want to know whether test items are unidimen- score high on the test possess all the necessary
sional. However, as stated previously, internal con- skills and might do well in the job. If few appli-
sistency does not imply unidimensionality. cants score high on all items, the company might
To determine whether items measure a unitary need a more detailed picture of the strengths and
construct, researchers can take one of two weaknesses of each applicant. In that case, the
approaches. First, they can calculate the average researcher could develop internally consistent sub-
interitem correlation. This correlation measures scales to measure each skill area, as described pre-
how closely the items are related to each other and viously. In that case, internal consistency would be
is the most common measure of item homogeneity. relevant to the subscales but would remain irrele-
However, the average interitem correlation might vant to the total test scores.
disguise differences between items. Perhaps some Fourth, researchers often assume mistakenly
items have strong relationships with each other that the formulas that are used to assess internal
and other items are unrelated. Second, researchers consistency—such as coefficient alpha—are only
can determine how many constructs underlie a set relevant to internal consistency. Usually these for-
of items by conducting an exploratory factor anal- mulas are used to estimate the reliability of total
ysis. If one construct underlies the items, the (or average) scores on a set of k items, but these
researcher can determine whether some items mea- formulas can also be used to estimate the reliabil-
sure that construct better than others. If two or ity of total scores from a set of k times or k
more constructs underlie the items, then the raters—or any composite score. For example, if
researcher can determine which items measure a researcher is interested in examining stable dif-
each construct and create homogeneous subscales ferences in emotion, participants could record their
to measure each. In summary, high internal consis- mood each day for a month. The researcher could
tency does not indicate that a test is unidimen- average the mood scores across the 30 days for
sional; instead, researchers should use exploratory each participant. Coefficient alpha can be used to
factor analysis to determine dimensionality. estimate how much of the observed differences
The third misinterpretation of internal consis- between participants are caused by differences
tency is that internal consistency is important for between days and how much is caused by stable
all tests. There are two exceptions. Internal consis- differences between the participants. Alternatively,
tency is irrelevant if test items are identical and researchers could use coefficient alpha to examine
trivially easy. For example, consider a speed test of raters. Job applicants could be rated by each man-
manual dexterity. For each item, participants draw ager in a company, and the average ratings could
three dots within a circle. Participants who com- be calculated for each applicant. The researcher
plete more items within the time limit receive could use coefficient alpha to estimate the propor-
higher scores. When items are identical and easy, tion of variance caused by true differences between
as they are in this example, J. C. Nunnally and the applicants—as opposed to the particular set of
I. H. Berstein showed that internal consistency will managers who provided ratings. Thus, coefficient
be very high and hence is not particularly informa- alpha (and the other formulas discussed previ-
tive. This conclusion makes sense conceptually: ously) can be used to estimate the reliability of any
When items are identical, very little variance in score that is calculated as the total or average
total test scores is caused by differences in items. of parallel measurements—whether those parallel
Researchers should instead focus on other types of measurements are obtained from different items,
reliability, such as test–retest. times, or raters.
Internal consistency is also irrelevant when the
test is designed deliberately to contain heteroge-
Going Beyond Classical Test Score Theory
neous content. For example, if a researcher wants
to predict success in an area that relies on several In classical test score theory, each source of vari-
different skills, then a test that assesses each of ance is considered separately. Internal consistency
these skills might be useful. If these skills are inde- reliability estimates the effect of test items. Test–
pendent of each other, the test might have low retest reliability estimates the effect of time. Inter-
internal consistency. However, applicants who rater reliability estimates the effect of rater. To
Internal Validity 619
provide a complete picture of the reliability of test Cronbach, L. J. (1951). Coefficient alpha and the internal
scores, the researcher must examine all types of structure of tests. Psychometrika, 16, 297–334.
reliability. Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963).
Even if a researcher examines every type of reli- Theory of generalizability: A liberalization of
reliability theory. British Journal of Statistical
ability, the results are incomplete and hard to inter-
Psychology, 16, 137–163.
pret. First, the reliability results are incomplete Kuder, G. F., & Richardson, M. W. (1937). The theory
because they do not consider the interaction of of the estimation of test reliability. Psychometrika, 2,
these factors. To what extent do ratings change 151–160.
over time? Do some raters score some items more Lord, F. M., & Novick, M. R. (1968). Statistical theories
harshly? Is the change in ratings over time consis- of mental test scores. Reading, MA: Addison-Wesley.
tent across items? Thus, classical test score theory Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
does not take into account two-way and three-way theory (3rd ed.). New York: McGraw-Hill.
interactions between items, time, and raters. Sec- Raykov, R., & Shrout, P. E. (2002). Reliability of scales
with general structure: Point and interval estimation
ond, the reliability results are hard to interpret
with structural equation modeling approach.
because each coefficient is given separately. If inter-
Structural Equation Modeling: A Multidisciplinary
nal consistency is .91, test–retest reliability is .85, Journal, 9, 195–212.
and interrater reliability is .82, then what propor- Schmitt, N. (1996). Uses and abuses of coefficient alpha.
tion of observed score variance is a result of true Psychological Assessment, 8, 350–353.
differences between participants and what propor-
tion is a result of these three sources of random
Websites
error?
To address these issues, researchers can use International Personality Item Pool: http://ipip.ori.org
more sophisticated mathematical models, which
are based on a multifactor repeated measures anal-
ysis of variance (ANOVA). First, researchers can
conduct a study to examine the influence of all INTERNAL VALIDITY
these factors on test scores. Second, researchers
can calculate generalizability coefficients to take Internal validity refers to the accuracy of state-
into account the number of items, times, and raters ments made about the causal relationship between
that will be used when collecting data to make two variables, namely, the manipulated (treatment
decisions in an applied context. or independent) variable and the measured vari-
able (dependent). Internal validity claims are not
Kimberly A. Barchard based on the labels a researcher attaches to vari-
ables or how they are described but, rather, to the
See also Classical Test Theory; Coefficient Alpha;
procedures and operations used to conduct
Exploratory Factor Analysis; Generalizability Theory;
a research study, including the choice of design
Interrater Reliability; Intraclass Correlation; KR-20;
and measurement of variables. Consequently,
Reliability; Structural Equation Modeling; Test–Retest
internal validity is relevant to the topic of research
Reliability
methods. In the next three sections, the procedures
that support causal inferences are introduced, the
threats to internal validity are outlined, and meth-
Further Readings ods to follow to increase the internal validity of
a research investigation are described.
Anastasi, A., & Urbina, S. (1997). Psychological testing
(7th ed.): Upper Saddle River, NJ: Prentice Hall.
Cortina, J. M. (1993). What is coefficient alpha? An Causal Relationships Between Variables
examination of theory and applications. Journal of
Applied Psychology, 78, 98–104. When two variables are correlated or found to
Crocker, L., & Algina, J. (1986). Introduction to classical covary, it is reasonable to ask the question of
and modern test theory. Orlando, FL: Harcourt Brace whether there is a direction in the relationship.
Jovanovich. Determining whether there is a causal relationship
620 Internal Validity
between the variables is often done by knowing one could argue that this relationship is not direct.
the time sequence of the variables; that is, whether The investigator’s discovery is a false positive
one variable occurred first followed by the other finding. The relationship between class size and
variable. In randomized experiments, where parti- academic achievement is not direct because the
cipants are randomly assigned to treatment condi- students associated with classes of different sizes
tions or groups, knowledge of the time sequence is are not equivalent on a key variable—attentive
often straightforward because the treatment vari- behavior. Thus, it might not be that larger class
able (independent) is manipulated before the mea- sizes have a positive influence on academic
surement of the outcome variable (dependent). achievement but, rather, that larger classes have
Even in quasi-experiments, where participants are a selection of students that, without behavioral
not randomly assigned to treatment groups, the problems, can attend to classroom instruction.
investigator can usually relate some of the change The third variable can threaten the internal
in pre-post test measures to group membership. validity of studies by leading to false positive find-
However, in observational studies where variables ings or false negative findings (i.e., not finding
are not being manipulated, the time sequence is a relationship between variables A and B because
difficult, if not impossible, to disentangle. of the presence of a third variable, C, that is
One might think that knowing the time diminishing the relationship between variables A
sequence of variables is often sufficient for ascer- and B). There are many situations that can give
taining internal validity. Unfortunately, time rise to the presence of uncontrolled third variables
sequence is not the only important aspect to con- in research studies. In the next section, threats to
sider. Internal validity is also largely about ensur- internal validity are outlined. Although each threat
ing that the causal relationship between two is discussed in isolation, it is important to note that
variables is direct and not mitigated by a third var- many of these threats can simultaneously under-
iable. A third, uncontrolled, variable can function mine the internal validity of a research study and
to make the relationship between the two other the accuracy of inferences about the causality of
variables appear stronger or weaker than it is in the variables involved.
real life. For example, imagine that an investigator
decides to investigate the relationship between
Threats to Internal Validity
class size (treatment variable) and academic
achievement (outcome variable). The investigator 1. History
recruits school classes that are considered large
An event (e.g., a new video game), which is not
(with more than 20 students) and classes that are
the treatment variable of interest, becomes accessi-
considered small (with fewer than 20 students).
ble to the treatment group but not the comparison
The investigator then collects information on stu-
group during the pre- and posttest time interval.
dents’ academic achievement at the end of the year
This event influences the observed effect (i.e., the
to determine whether a student’s achievement
outcome, dependent variable). Consequently, the
depends on whether he or she is in a large or small
observed effect cannot be attributed exclusively to
class. Unbeknownst to the investigator, however,
the treatment variable (thus threatening internal
students who are selected to small classes are those
validity claims).
who have had behavioral problems in the previous
year. In contrast, students assigned to large classes
are those who have not had behavioral problems
2. Maturation
in the previous year. In other words, class size is
related negatively to behavioral problems. Conse- Participants develop or grow in meaningful
quently, students assigned to smaller classes will be ways during the course of the treatment
more disruptive during classroom instruction and (between the pretest and posttest). The develop-
will plausibly learn less than those assigned to mental change in participants influences the
larger classes. In the course of data analysis, if the observed effect, and so now the observed effect
investigator were to discover a significant relation- cannot be solely attributed to the treatment
ship between class size and academic achievement, variable.
Internal Validity 621
develop a new curriculum. In other words, when ensure that the groups are equivalent on key vari-
the treatment is considered desirable, there might ables. For example, if the groups are equivalent,
be administrative pressure to compensate the one would expect both groups to score similarly on
control group, thereby undermining the observed the pretest measure. Furthermore, one would
effect of the treatment. inquire about the background characteristics of the
students—Are there equal distributions of boys and
girls in the groups? Do they come from comparable
11. Rivalry Between Treatment Conditions
socioeconomic backgrounds? Even if the treatment
Similar to the threat described in number 9, groups are comparable, efforts should be taken to
threat number 10 functions to nullify differences not publicize the nature of the treatment one group
between treatment groups and, thus, an observed is receiving relative to the control group so as to
effect. In this case, when participation in a treat- avoid threats to internal validity involving diffusion
ment versus control group is made public, control of treatment information, compensatory equaliza-
participants might work extra hard to outperform tion of treatments, rivalry between groups, and
the treatment group. Had participants not been demoralization of participants that perceive to be
made aware of their group membership, an receiving the less desirable treatment. Internal valid-
observed effect might have been found. ity checks are ultimately designed to bolster confi-
dence in the claims made about the causal
relationship between variables; as such, internal
12. Demoralization of Participants Receiving Less
validity is concerned with the integrity of the design
Desirable Treatments
of a study for supporting such claims.
This last threat is similar to the one described in
number 11. In this case, however, when treatment Jacqueline P. Leighton
participation is made public and the treatment is
See also Cause-and-Effect; Control Variables; Quasi-
highly desirable, control participants might feel
Experimental Design; Random Assignment; True
resentful and disengage with the study’s objective.
Experimental Design
This could lead to large differences in the outcome
variable between the treatment and control groups.
However, the observed outcome might have little to Further Readings
do with the treatment and more to do with partici-
Cook, T. D., & Campbell, D. T. (1979). Quasi-
pant demoralization in the control group. experimentation: Design and analysis issues for field
settings. Boston: Houghton-Mifflin.
Establishing Internal Validity Keppel, G., & Wickens, T. D. (2002). Design and
analysis: A researcher’s handbook (4th ed.). Upper
Determining whether there is a causal relationship Saddle River, NJ: Prentice Hall.
between variables, A and B, requires that the vari- Rosenthal, R., & Rosnow, R. L. (1991). Essentials of
ables covary, the presence of one variable preceding behavioral research: Methods and data analysis (2nd
the other (e.g., A → B), and ruling out the pres- ed.). Boston: McGraw-Hill.
ence of a third variable, C, which might mitigate Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasi-experimental designs for
the influence of A on B. One powerful way to
generalized causal inference. Boston: Houghton-
enhance internal validity is to randomly assign sam- Mifflin.
ple participants to treatment groups or conditions.
By randomly assigning, the investigator can guaran-
tee the probabilistic equivalence of the treatment
groups before the treatment variable is adminis- INTERNET-BASED
tered. That is, any participant biases are equally
distributed in the two groups. If the sample partici- RESEARCH METHOD
pants cannot be randomly assigned, and the investi-
gator must work with intact groups, which is often Internet-based research method refers to any
the case in field research, steps must be taken to research method that uses the Internet to collect
Internet-Based Research Method 623
by John Krantz, Jody Ballard, and Jody Scher, both cases, the easiest path to be helpful was pre-
as well as by Ulf Reips. One main experiment dominantly taken. So both studies complement
performed by Krantz, Ballard, and Scher was each other.
a within-subject experiment examining preferences
for different weighted female drawings. The exper-
Practical Issues
iment was a 3 × 9 design with independent vari-
ables on weight and shoulder-to-hip proportions. Next, several practical issues that can influence the
The experiment was conducted both under tradi- data quality obtained from Web-based studies is
tional laboratory conditions and on the Web. The discussed. When a Web-based study is read, it is
participants gave ratings of preference for each fig- important to understand how the experimenter
ure using a magnitude estimation procedure. The handles the following factors.
use of the within-subject methodology and a mag-
nitude estimation procedure allowed for a detailed
Recruitment
comparison of the results found in the laboratory
with the results found on the Web. First, the results Just placing a study on the Web is not usually
were highly correlated between laboratory and sufficient to get an adequately large sample to ana-
Web. In addition, a regression analysis was per- lyze. It is typical to advertise the study. There are
formed on the two data sets to determine whether several methods of study advertising that are used.
the data do more than move in the same direction. One way is to advertise on sites that list psycho-
This regression found that the Web values are logical studies. The two largest are listed at the
nearly identical to the values for the same condi- end of this entry. These sites are also well known
tion in the laboratory; that is, the Web data essen- to people interested in participating in psychologi-
tially can be replaced by the laboratory data and cal research, which makes them a useful means for
vice versa. This similarity was found despite the participants to find research studies. These sites
vast difference in the ways the experiment was also come up at the top of searches for psychologi-
delivered (e.g., in the laboratory, the participants cal experiments and related terms in search
were tested in groups, and on the Web, presum- engines. Another common method is to solicit
ably most participants ran the study singly) and participants from discussion groups or e-mail
differences in age range (in the laboratory, all par- listservs. Because these groups tend to be formed
ticipants are traditional college students, whereas to discuss common issues, this method allows
a much greater age range was observed in the Web access to subpopulations that might be of interest.
version). With the advent of social networking on the Web,
Krantz and Reshad Dalal performed a literature social networking sites such as Facebook have also
review of the Web studies conducted up to the been used to recruit participants. Finally, tradi-
time this entry was written. The point of the tional media such as radio and television can be
review was to determine whether the Web results used to recruit participants to Internet studies. It
could be considered, at least in a general sense, has occurred that some network news programs
valid. They examined both e-mail and Web-based have found an experiment related to a show they
research methods and compared their results with were running and posted a link to that study on
laboratory method results. In general, they found the website associated with the show. It should be
that most Web studies tended to find weaker noted that the Web is not a monolithic entity. Dif-
results than in the control of the laboratory. How- ferent methods of recruitment will lead to different
ever, the results seemed valid and even in cases samples. Depending on the sample needs of the
where the data differed from the laboratory or study, it is often advisable to use multiple types of
field, the differences were intriguing. One study recruitment methods.
performed an e-mail version of Stanley Milgram’s
lost letter technique. In Milgram’s study, the letters
Sample Characteristics
that were mailed were sent to the originally
intended destination. However, the e-mails that One enticing feature of the Web as a research
were sent were returned to the original sender. In environment is the ability to obtain more diverse
Internet-Based Research Method 625
samples. Samples are more diverse on the Web in the background. All of these factors, and others,
than the comparable sample in the laboratory. add potential sources of error to the data collected
However, that is not to say that the samples are over the Web. The term technical variance has
truly representative. Web use is not even distrib- been applied to this source of variation in experi-
uted across all population segments. It is probably mental conditions. Many of these variations, such
wise to consider the Web population in a mode as browser type and version, can be collected dur-
similar to early days of the telephone which, when ing the experiment, allowing some assessment of
used for sampling without attention to the popula- the influence of these technical variations on the
tion that had telephones, led to some classic mis- data. However, although rare, it is possible for
taken conclusions in political polls. users to hide or alter these values, such as altering
what browser is being used.
Dropout
One of the big concerns in Web-based research
is the ease with which participants can leave the Ethical Issues
study. In the laboratory, it is rare for a participant There have been some major discussions of the
to up and leave the experiment. Participants in ethics of Web-based research. In particular, the
Web-based research regularly do not complete lack of contact between the experimenter and par-
a study, leaving the researcher with several incom- ticipant means that it is more difficult to ensure
plete data sets. Incomplete data can make up to that the nature of the study is understood. In addi-
40% of a data set in some studies. There are two tion, there is no way to be sure that any partici-
main concerns regarding dropout. First, if the pant is debriefed rendering the use of deception
dropout is not random but is selective in some particularly problematic on the Web. However, on
sense, it can limit the generalizability of the results. the positive side, participants do feel very free to
Second, if the conditions in experiments differ in leave the study at any time, meaning that it can be
a way that causes differential dropout across con- more clearly assumed that the sample is truly vol-
ditions, this fact can introduce a confound in the untary, free of the social constraints that keep
experiment. This factor must be examined in eval- many participants in laboratory experiments when
uating the conditions. Information about the they wish to leave.
length of the study and the careful use of incen-
tives can reduce dropout. In addition, it is possible
in experiments that use many pages to measure
dropout and use it as a variable in the data Future Directions
analysis. As Web research becomes more widely accepted,
the main future direction will be Web-based
Technical Variance research examining new topic areas that are not
possible to be done in the laboratory. To date most
One of the big differences between the labora- Web-based studies have been replications and
tory and the Web is the loss of control over the extensions of existing research studies. Another
equipment used by the participant. In the labora- development will be the greater use of media in
tory, it is typical to have the participants all use the experiments. Most studies to date have been
the same computer or same type of computer, con- principally surveys with maybe a few images used
trol environmental conditions, and any other fac- as stimuli. The variations of monitors and lighting
tor that might influence the outcome of the study. have rendered the use of any images, beyond the
On the Web, such control is not possible. Varia- simplest, problematic. The development of better
tions include the type of computer being used, the and more controlled methods of delivering images
way the person is connected to the network, and video will allow a wider range of studies to be
the type of browser, the size of browser window explored over the Web.
the participant prefers, the version of the browser,
and even what other programs might be running John H. Krantz
626 Interrater Reliability
See also Bias; Confounding; Ethics in the Research Generality is important in showing that the
Process; Experimental Design; Sampling obtained ratings are not the idiosyncratic results of
one person’s subjective judgment. Procedure ques-
Further Readings tions include the following: How many raters are
needed to be confident in the results? What is the
Binbaum, M. H. (2000). Psychological experiments on minimum level of agreement that the raters need
the internet. San Diego, CA: Academic Press. to achieve? Is it necessary for the raters to agree
Krantz, J. H., Ballard, J., & Scher, J. (1997). Comparing exactly or is it acceptable for them to differ from
the results of laboratory and world-wide Web samples
one another as long as the differences are system-
on the determinants of female attractiveness.
Behavioral Research Methods, Instruments, & atic? Are the data nominal, ordinal, or interval?
Computers, 29, 264–269. What resources are available to conduct the inter-
Krantz, J. H., & Dalal, R. (2000). Validity of Web-based rater reliability study (e.g., time, money, and tech-
psychological research. In M. H. Birnbaum (Ed.), nical expertise)?
Psychological experiments on the Internet Interrater or interobserver (these terms can be
(pp. 35–60). San Diego, CA: Academic Press. used interchangeably) reliability is used to assess
Musch, J., & Reips, U.-D. (2000). A brief history of Web the degree to which different raters or observers
experimenting. In M. H. Birnbaum (Ed.), make consistent estimates of the same phenome-
Psychological experiments on the Internet
non. Another term for interrater or interobserver
(pp. 61–88). San Diego, CA: Academic Press.
reliability estimate is consistency estimates. That
Reips, U.-D. (2002). Standards for Internet-based
experimenting. Experimental Psychology, is, it is not necessary for raters to share a common
49, 243–256. interpretation of the rating scale, as long as each
judge is consistent in classifying the phenomenon
according to his or her own viewpoint of the
Websites scale. Interrater reliability estimates are typically
Psychological Research on the Net: reported as correlational or analysis of variance
http://psych.hanover.edu/research/exponnet.html indices. Thus, the interrater reliability index repre-
Web Experiment List: http://genpsylab-wexlist.unizh.ch sents the degree to which ratings of different
judges are proportional when expressed as devia-
tions from their means. This is not the same as
interrater agreement (also known as a consensus
INTERRATER RELIABILITY estimate of reliability), which represents the extent
to which judges make exactly the same decisions
The use of raters or observers as a method of mea- about the rated subject. When judgments are made
surement is prevalent in various disciplines and on a numerical scale, interrater agreement gener-
professions (e.g., psychology, education, anthro- ally means that the raters assigned exactly the
pology, and marketing). For example, in psycho- same score when rating the same person, behavior,
therapy research raters might categorize verbal or object. However, the researcher might decide to
(e.g., paraphrase) and/or nonverbal (e.g., a head define agreement as either identical ratings or rat-
nod) behavior in a counseling session. In educa- ings that differ no more than one point or as rat-
tion, three different raters might need to score an ings that differ no more than two points (if the
essay response for advanced placement tests. This interest is in judgment similarity). Thus, agreement
type of reliability is also present in other facets of does not have to be defined as an all-or-none phe-
modern society. For example, medical diagnoses nomenon. If the researcher does decide to include
often require a second or even third opinion from a discrepancy of one or two points in the definition
physicians. Competitions, such as Olympic figure of agreement, the chi-square value for identical
skating, award medals based on quantitative rat- agreement should also be reported. It is possible to
ings provided by a panel of judges. have high interrater reliability but low interrater
Those data recorded on a rating scale are based agreement and vice versa. The researcher must
on the subjective judgment of the rater. Thus, the determine which form of determining rater reli-
generality of a set of ratings is always of concern. ability is most important for the particular study.
Interrater Reliability 627
Whenever rating scales are being employed, it is rating scale, as long as each rater is consistent in
important to pay special attention to the interrater assigning a score to the phenomenon. Consistency
or interobserver reliability and interrater agree- is most used with continuous data. Values of .70 or
ment of the rating. It is essential that both the reli- better are generally considered to be adequate. The
ability and agreement of the ratings are provided three most common types of consistency estimates
before the ratings are accepted. In reporting the are (1) correlation coefficients (e.g., Pearson and
interrater reliability and agreement of the ratings, Spearman), (2) Cronbach’s alpha, and (3) intraclass
the researcher must describe the way in which the correlation.
index was calculated. The Pearson product-moment correlation coeffi-
The remainder of this entry focuses on calculat- cient is the most widely used statistic for calculat-
ing interrater reliability and choosing an appropri- ing the degree of consistency between independent
ate approach for determining interrater reliability. raters. Values approaching þ1 or 1 indicate that
the raters are following a consistent pattern,
whereas values close to zero indicate that it would
Calculations of Interrater Reliability
be almost impossible to predict the rating of one
For nominal data (i.e., simple classification), at judge given the rating of the other judge. An
least two raters are used to generate the categorical acceptable level of reliability using a Pearson cor-
score for many participants. For example, a contin- relation is .70. Pearson correlations can only be
gency table is drawn up to tabulate the degree of calculated for one pair of judges at a time and for
agreement between the raters. Suppose 100 obser- one item at a time. The Pearson correlation
vations are rated by two raters and each rater assumes the underlying data are normally distrib-
checks one of three categories. If the two raters uted. If the data are not normally distributed, the
checked the same category in 87 of the 100 obser- Spearman rank coefficient should be used. For
vations, the percentage of agreement would be example, if two judges rate a response to an essay
87%. The percentage of agreement gives a rough item from best to worst, then a ranking and the
estimate of reliability and it is the most popular Spearman rank coefficient should be used.
method of computing a consensus estimate of If more than two raters are used, Cronbach’s
interrater reliability. The calculation is also easily alpha correlation coefficient could be used to com-
done by hand. Although it is a crude measure, it pute interrater reliability. An acceptable level for
does work no matter how many categories are Cronbach’s alpha is .70. If the coefficient is lower
used in each observation. An adequate level of than .70, this means that most of the variance in
agreement is generally considered to be 70%. the total composite score is a result of error vari-
However, a better estimate of reliability can be ance and not true score variance.
obtained by using Cohen’s kappa, which ranges The best measure of interrater reliability avail-
from 0 to 1 and represents the proportion of agree- able for ordinal and interval data is the intraclass
ment corrected for chance. correlation (RÞ. It is the most conservative measure
of interrater reliability. R can be interpreted as the
K ¼ ðρa ρc Þ=ð1 ρc Þ; proportion of the total variance in the ratings
caused by variance in the persons or phenomena
where ρa is the proportion of times the raters agree being rated. Values approaching the upper limit of
and ρc is the proportion of agreement we would R(1.00) indicate a high degree of reliability,
expect by chance. This formula is recommended whereas an R of 0 indicates a complete lack of reli-
when the same two judges perform the ratings. ability. Although negative values of R are possible,
For Cohen’s kappa, .50 is considered acceptable. If they are rarely observed; when they are observed,
subjects are rated by different judges but the num- they imply judge × item interactions. The more R
ber of judges rating each observation is held con- departs from 1.00, the less reliable are the judge’s
stant, then Fleiss’ kappa is preferred. ratings. The minimal acceptable level of R is con-
Consistency estimates of interrater reliability are sidered to be .60. There is more than one formula
based on the assumption that it is not necessary for available for intraclass correlation. To select the
the judges to share a common interpretation of the appropriate formula, the investigator must decide
628 Interrater Reliability
(a) whether the mean differences in the ratings of intraclass correlation coefficients) are also fairly
the judges should be considered rater error and simple to compute. The greatest disadvantage to
(b) whether he or she is more concerned with the using these statistical techniques is that they are
reliability of the average rating of all the judges or sensitive to the distribution of the data. The more
the average reliability of the individual judge. the data depart from a normal distribution, the
The two most popular types of measurement more attenuated the results.
estimates of interrater reliability are (a) factor The measurement estimates of interrater reli-
analysis and (b) the many-facets Rasch model. The ability (e.g., factor analysis and many-facets Rasch
primary assumption for the measurement estimates measurement) can work with multiple judges, can
is that all the information available from all the adjust summary scores for rater severity, and can
judges (including discrepant ratings) should be allow for efficient designs (e.g., not all raters have
used when calculating a summary score for each to judge each item or object). However, the mea-
respondent. Factor analysis is used to determine surement estimates of interrater reliability require
the amount of shared variance in the ratings. The expertise and considerable calculation time.
minimal acceptable level is generally 70% of the Therefore, as noted previously, the best tech-
explained variance. Once the interrater reliability nique will depend on the goals of the study, the
is established, each subject will receive a summary nature of the data (e.g., degree of normality), and
score based on his or her loading on the first prin- the resources available. The investigator might also
cipal component underlying the ratings. Using the improve reliability estimates with additional train-
many-facets Rasch model, the ratings between ing of raters.
judges can be empirically determined. Also the dif-
ficulty of each item, as well as the severity of all Karen D. Multon
judges who rated each item, can be directly com-
See also Cohen’s Kappa; Correlation; Instrumentation;
pared. In addition, the facets approach can deter-
Intraclass Correlation; Pearson Product-Moment
mine to what degree each judge is internally
Correlation Coefficient; Reliability; Spearman Rank
consistent in his or her ratings (i.e., an estimate of
Order Correlation
intrarater reliability). For the many-facets Rasch
model, the acceptable rater values are greater than
.70 and less than 1.3.
Further Readings
Bock, R., Brennan, R. L., & Muraki, E. (2002). The
Choosing an Approach information in multiple ratings. Applied Psychological
There is no ‘‘best’’ approach for calculating interra- Measurement, 26; 364–375.
ter or interobserver reliability. Each approach has Cohen, J. (1960). A coefficient of agreement for nominal
scales. Educational and Psychological Measurement,
its own assumptions and implications as well as its
20, 37–46.
own strengths and weaknesses. The percentage of Fleiss, J. L. (1971). Measuring nominal scale agreement
agreement approach is affected by chance. Low among many raters. Psychological Bulletin, 76,
prevalence of the condition of interest will affect 378–382.
kappa and correlations will be affected by low var- Linacre, J. M. (1994). Many-facet Rasch measurement.
iability (i.e., attenuation) and distribution shape Chicago: MESA Press.
(normality or skewed). Agreement estimates of Snow, A. L., Cook, K. F., Lin, P. S., Morgan, R. O., &
interrater reliability (percent agreement, Cohen’s Magaziner, J. (2005). Proxies and other external
kappa, Fleiss’ kappa) are generally easy to compute raters: Methodological considerations. Health Services
and will indicate rater disparities. However, train- Research, 40, 1676–1693.
Stemler, S. E., & Tsai, J. (2008). Best practices in
ing raters to come to an exact consensus will
interrater reliability: Three common approaches.
require considerable time and might or might not In J. W. Osborne (Ed.), Best practices in quantitative
be necessary for the particular study. methods (pp. 29–49). Thousand Oaks, CA: Sage.
Consistency estimates of interrater reliability Tinsley, H. E. A., & Weiss, D. J. (1975). Interrater
(e.g., Pearson product-moment and Spearman rank reliability and agreement of subjective judgements.
correlations, Cronbach’s alpha coefficient, and Journal of Counseling Psychology, 22, 358–376.
Interval Scale 629
Temperature scales including the Fahrenheit and One example of interval scale measurement that
Celsius temperature scales are examples of an is widely used in social science is the Likert scale.
interval scale. For example, the Fahrenheit temper- In experimental research, particularly in social
ature scale in which the difference between 25 sciences, there are measurements to capture atti-
and 30 is the same as the difference between 80 tudes, perceptions, positions, feelings, thoughts, or
and 85 . In the Celsius temperature scales, the dis- points of view of research participants. Research
tance between 16 and 18 is the same as that participants are given questions and they are
between 78 and 80 . expected to express their responses by choosing
However, 60 F is not twice as hot as 30 F. Simi- one of five or seven rank-ordered response choices
larly, –40 C is not twice as cold as –20 C. This is that is closest to their attitudes, perceptions, posi-
because both Fahrenheit and Celsius temperature tions, feelings, thoughts, or points of view.
scales do not have a ‘‘true zero’’ point. The zero An example of the Likert scales that uses a
points in the Fahrenheit and Celsius temperature 5-point scale is as follows:
scales are arbitrary—in both scales, 0 does not
mean the lack of heat nor cold. How satisfied are you with the neighborhood
In contrast, the Kelvin temperature scale is where you live?
based on a ‘‘true zero’’ point. The zero point of the • Very satisfied
Kelvin temperature scale, which is equivalent to • Somewhat satisfied
–459.67 F or –273.15 C is considered the lowest • Neither satisfied nor dissatisfied
possible temperature of anything in the universe. • Somewhat dissatisfied
In the Kelvin temperature scale, 400 K is twice as • Very dissatisfied
hot as 200 K, and 100 K is twice as cold as 200 K.
The Kelvin temperature scale is not an example of Some researchers argue that such responses are
interval scale but that of ratio scale. not interval scales because the distance between
630 Intervention
attributes are not equal. For example, the differ- Further Readings
ence between very satisfied and somewhat satisfied
Babbie, E. (2007). The practice of social research (11th
might not be the same as that between neither sat- ed.). Belmont, CA: Thomson and Wadsworth.
isfied nor dissatisfied and somewhat dissatisfied. Dillman, D. A. (2007). Mail and internet surveys: The
Each attribute in the Likert scales is given tailored design method (2nd ed.). New York: Wiley.
a number. For the previous example, very satisfied Keyton, J. (2006). Communication research: Asking
is 5, somewhat satisfied is 4, neither satisfied nor questions, finding answers (2nd ed.). Boston:
dissatisfied is 3, somewhat dissatisfied is 2, and McGraw-Hill.
very dissatisfied is 1. The greater number repre- University Corporation for Atmospheric Research.
sents the higher degree of satisfaction of respon- (2001). Windows to the universe: Kelvin scale.
Retrieved December 14, 2008, from http://
dents of their neighborhood. Because of such
www.windows.ucar.edu/cgi-bin/tour_def/earth/
numbering, there is now equal distance between Atmosphere/temperature/kelvin.html
attributes. For example, the difference between
very satisfied (5) and somewhat satisfied (4) is the
same as the difference between neither satisfied
nor dissatisfied (3) and somewhat dissatisfied (2).
However, the Likert scale does not have a ‘‘true INTERVENTION
zero’’ point, as shown in the previous example, so
that statements about the ratio of attributes in the Intervention research examines the effects of an
Likert scale cannot be made. intervention on an outcome of interest. The pri-
mary purpose of intervention research is to engen-
der a desirable outcome for individuals in need
Semantic Differential Scale
(e.g., reduce depressive symptoms or strengthen
Another interval scale measurement is the semantic reading skills). As such, intervention research
differential scale. Research respondents are given might be thought of as differing from prevention
questions and also semantic differential scales, usu- research, where the goal is to prevent a negative
ally 7-point or 5-point response scales, as their outcome from occurring, or even from classic lab-
response choices. Research respondents are expected oratory experimentation, where the goal is often
to choose 1 scale out of 7 or 5 semantic differential to support specific tenets of theoretical paradigms.
scales that is closest to their condition or perception. Assessment of an intervention’s effects, the sine
An example of the semantic differential scales qua non of intervention research, varies according
that uses a 7-point scale is as follows: to study design, but typically involves both statisti-
cal and logical inferences.
How would you rate the quality of the neighbor- The hypothetical intervention study presented
hood where you live? next is used to illustrate important features of
intervention research. Assume a researcher wants
Excellent Poor
to examine the effects of parent training (i.e., inter-
7 6 5 4 3 2 1 vention) on disruptive behaviors (i.e., outcome)
among preschool-aged children. Of 40 families
seeking treatment at a university-based clinic, 20
Research respondents who rate the quality of
families were randomly assigned to an intervention
their neighborhood as excellent should choose ‘‘7’’
condition (i.e., parent training) and the remaining
and those who rate the quality of their neighbor-
families were assigned to a (wait-list) control con-
hood as poor should choose ‘‘1.’’ Similar to the
dition. Assume the intervention was composed of
Likert scale, there is equal distance between attri-
six, 2-hour weekly therapy sessions with the par-
butes in the semantic differential scales, but there
ent(s) to strengthen theoretically identified parent-
is no ‘‘true zero’’ point.
ing practices (e.g., effective discipline strategies)
Deden Rukmana believed to reduce child disruptive behaviors.
Whereas parents assigned to the intervention con-
See also Ordinal Scale; Ratio Scale dition attended sessions, parents assigned to the
Intervention 631
control condition received no formal intervention. groups are probabilistically equated on all mea-
In the most basic form of this intervention design, sured and unmeasured characteristics), it is
data from individuals in both groups are collected unlikely that some other factor resulted in postin-
at a single baseline (i.e., preintervention) assess- tervention group differences. It is worth noting
ment and at one follow-up (i.e., postintervention) that this protection conveyed by random assign-
assessment. ment can be undone once the study commences
(e.g., by differential attrition or participant loss). It
is also worth noting that quasi-experiments or
Assessing the Intervention’s Effect
intervention studies that lack random assignment
In the parenting practices example, the first step in to condition are more vulnerable to internal valid-
assessing the intervention’s effect involves testing ity threats. Thoughtful design and analysis of
for a statistical association between intervention quasi-experiments typically involve identifying sev-
group membership (intervention vs. control) and eral plausible internal validity threats a priori and
the identified outcome (e.g., reduction in temper incorporating a mixture of design and statistical
tantrum frequency). This is accomplished by using controls that attempt to rule out (or render
an appropriate inferential statistical procedure implausible) the influence of these threats.
(e.g., an independent-samples t test) coupled with
an effect size estimate (e.g., Cohen’s dÞ, to provide
Other Things to Consider
pertinent information regarding both the statistical
significance and strength (i.e., the amount of bene- Thus far, this discussion has focused nearly exclu-
fit) of the intervention–outcome association. sively on determining whether the intervention
Having established an intervention–outcome worked. In addition, intervention researchers often
association, researchers typically wish to ascertain examine whether certain subgroups of participants
whether this association is causal in nature (i.e., benefited more from exposure to the intervention
that the intervention, not some other factor, caused than did other subgroups. In the parenting exam-
the observed group difference). This more formi- ple, one might find that parents with a single child
dable endeavor of establishing an ‘‘intervention to respond more favorably to the intervention than
outcome’’ causal connection is known to social sci- do parents with multiple children. Identifying this
ence researchers as establishing a study’s internal subgroup difference might aid researchers in modi-
validity—the most venerable domain of the fying the intervention to make it more effective for
renowned Campbellian validity typology. Interven- parents with multiple children. This additional var-
tion studies considered to have high internal valid- iable (in this case, the subgroup variable) is referred
ity have no (identified) plausible alternative to as an intervention moderator. The effects of
explanations (i.e., internal validity threats) for the intervention moderators can be examined by test-
intervention–outcome association. As such, the ing statistical interactions between intervention
most parsimonious explanation for the results is group membership and the identified moderator.
that the intervention caused the outcome. Intervention researchers should also examine
the processes through which the intervention pro-
duced changes in the outcome. Examining these
Random Assignment in Intervention Research
process issues typically requires the researcher to
The reason random assignment is a much-heralded construct a conceptual roadmap of the interven-
design feature is its role in reducing the number of tion’s effects. In other words, the researcher must
alternative explanations for the intervention– specify the paths followed by the intervention in
outcome association. In randomized experiments affecting the outcomes. These putative paths are
involving a no-treatment control, the control con- referred to as intervention mediators. In the par-
dition provides incredibly important information enting example, these paths might be (a) better
regarding what would have happened to the inter- understanding of child behavior, (b) using more
vention participants had they not been exposed to effective discipline practices, or (c) increased levels
the intervention. Because random assignment pre- of parenting self-efficacy. Through statistical medi-
cludes systematic pretest group differences (as the ation analysis, researchers can test empirically
632 Interviewing
See also External Validity; Internal Validity; Quasi- Physical attributes such as age, race, gender,
Experimental Designs; Threats to Validity; Treatment(s) and voice, as well as attitudinal attributes such as
friendliness, professionalism, optimism, persua-
siveness, and confidence, are important attributes
Further Readings that should be borne in mind when selecting inter-
Aiken, L. S., & West, S. G. (1991). Multiple regression: viewers. Even when questions are well written, the
Testing and interpreting interactions. Newbury Park, success of face-to-face and telephone surveys are
CA: Sage. still very much dependent on the interviewer. Inter-
Campbell, D. T. (1957). Factors relevant to the validity views are conducted to obtain information.
of experiments in social settings. Psychological However, information can only be obtained if
Bulletin, 54, 297–312. respondents feel sufficiently comfortable in an
Cook, T. D., & Campbell, D. T. (1979). Quasi- interviewer’s presence. Good interviewers have
experimentation: Design and analysis issues for field
excellent social skills, show a genuine interest in
settings. Chicago: Rand McNally.
getting to know their respondents, and recognize
Cronbach, L. J. (1982). Designing evaluations of
educational and social programs. San Francisco: that they need to be flexible in accommodating
Jossey-Bass. respondents’ schedules.
MacKinnon, D. P. (2008). Introduction to statistical Research shows that interviewer characteristics
mediation analysis. New York: Lawrence Erlbaum. can definitely affect both item response and
Reynolds, K., & West, S. G. (1988). A multiplist strategy response quality and might even affect a respon-
for strengthening nonequivalent control group designs. dent’s decision to participate in an interview. It
Evaluation Review, 11, 691–714. might, therefore, be desirable in many cases to
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). match interviewers and interviewees in an effort to
Experimental and quasi-experimental designs for
solicit respondents’ cooperation, especially for
generalized causal inference. Boston: Houghton
interviews that deal with sensitive topics (e.g.,
Mifflin.
racial discrimination, inequality, health behavior,
or domestic abuse) or threatening topics (e.g., ille-
gal activities). For example, women interviewers
INTERVIEWING should be used to interview domestically abused
women. Matching might also be desirable in some
cultures (e.g., older males to interview older
Interviewing is an important aspect of many types males), or for certain types of groups (e.g., minor-
of research. It involves conducting an interview— ity interviewers for minority groups). Additionally,
a purposeful conversation—between two people matching might help to combat normative
(the interviewer and the interviewee) to collect responses (i.e., responding in a socially desirable
data on some particular issue. The person asking way) and might encourage respondents to speak in
the questions is the interviewer, whereas the person a more candid manner. Matching can be done on
providing the answers is the interviewee (i.e., several characteristics, namely, race, age, ethnicity,
respondent). Interviewing is used in both quantita- and sex.
tive and qualitative research and spans a wide con-
tinuum of forms, moving from totally structured
Interviewer Training
to totally unstructured. It can use a range of tech-
niques including face-to-face (in-person), tele- When a research study is large and involves the
phone, videophone, and e-mail. Interviewing use of many interviewers, it will require proper
involves several steps, namely, determining the training, administration, coordination, and con-
interviewees, preparing for the interview, and con- trol. The purpose of interviewer training is to
ducting the interview. ensure that interviewers have the requisite skills
Interviewing 633
that are essential for the collection of high-quality, deviate substantially from the estimated time
reliable, and valid data. The length of training will required for the interview.
be highly dependent on the mode of survey execu-
tion, as well as the interviewers’ experience. The
The Interview
International Standards Association, for example,
recommends a minimum of 6 hours of training for At the beginning of the interview, the inter-
new telephone interviewers involved in market, viewer should greet the respondent in a friendly
opinion, and social research. manner, identify and introduce himself or herself,
Prior to entering the field, an interviewer train- and thank the respondent for taking time to facili-
ing session should be conducted with all involved tate the interview. If a face-to-face interview is
interviewers. At this session, interviewers should being conducted, the interviewer should also pre-
be given a crash course on basic research issues sent the interviewee with an official letter from the
(e.g., importance of random sampling, reliability institution sponsoring the research, which outlines
and validity, and interviewer-related effects). They the legitimacy of the research and other salient
should also be briefed on the study objectives and issues such as the interviewer’s credentials. In tele-
the general guidelines/procedures/protocol that phone and face-to-face interviews where contact
should be followed for data collection. If a struc- was not established in advance (e.g., a national
tured questionnaire is being used to collect data, it survey), the interviewer has to try to elicit the
is important that the group go through the entire cooperation of the potential respondent and
questionnaire, question by question, to ensure that request permission to conduct the interview. In
every interviewer clearly understands the question- unscheduled face-to-face interviews, many poten-
naire. This should be followed by one or more tial respondents might refuse to permit an inter-
demonstrations to illustrate the complete interview view for one or more of the following reasons:
process. Complications and difficulties encoun- busy, simply not interested, language barrier, and
tered during the demonstrations, along with safety concerns. In telephone interviews, the
recommendations for coping with the problems, respondent might simply hang up the telephone
should be discussed subsequent to the demonstra- with or without giving an excuse.
tion. Detailed discussion should take place on After introductions, the interviewer should then
how to use probes effectively and how to quickly brief the respondent on the purpose of the study,
change ‘‘tone’’ if required. A pilot study should be explain how the study sample was selected,
conducted after training to identify any additional explain what will be done with the data, explain
problems or issues. how the data will be reported (i.e., aggregated
statistics—no personal information), and, finally,
assure the respondent of anonymity and confiden-
tiality. The interviewer should also give the respon-
Preparing for the Interview
dent an idea of the estimated time required for the
Prior to a face-to-face interview, the interviewer interview and should apprise the interviewee of his
should either telephone or send an official letter or her rights during the interview process (e.g.,
to the interviewee to confirm the scheduled time, right to refuse to answer a question if respondent
date, and place for the interview. One of the most is uncomfortable with the question). If payment of
popular venues for face-to-face interviews is any kind is to be offered, this should also be
respondents’ homes; however, other venues can explained to the respondent.
also be used (e.g., coffee shops, parking lots, or An interviewer should try to establish good rap-
grocery stores). When sensitive topics are being port with the interviewee to gain the interviewee’s
discussed, a more private venue is desirable so that confidence and trust. This is particularly important
the respondent can talk candidly. Interviewers in a qualitative interview. Establishing good rap-
should also ensure that they are thoroughly port is, however, highly dependent on the inter-
acquainted with the questionnaire and guidelines viewer’s demeanor and social skills. Throughout
for the interview. This will help to ensure that the interview, the interviewer should try to make
the interview progresses smoothly and does not the conversational exchange a comfortable and
634 Interviewing
pleasant experience for the interviewee. Pleasant- questions be recorded verbatim to minimize errors
ries and icebreakers can set the tone for the inter- that could result from inaccurate summation. Ver-
view. At the same time, interviewers should be batim responses will also permit more accurate
detached and neutral, and should refrain from coding. Throughout the interview, the interviewer
offering any personal opinions. During the inter- should try to ensure that note taking is as unobtru-
view, interviewers should use a level of vocabulary sive as possible. Audio recordings should be used
that is easily understood by the respondent and to back up handwritten notes if the respondent has
should be careful about using certain gestures (this no objection. However, it might be necessary at
concern is applicable only to face-to-face inter- times to switch off the machine if the respondent
views) and words because they might be consid- seems reluctant to discuss a sensitive topic. Audio
ered offensive in some cultures and ethnic groups. recordings offer several advantages, namely, they
In addition, interviewers should maintain a relaxed can verify the accuracy of handwritten notes and
stance (body language communicates information) can be used to help interviewers to improve their
and a pleasant and friendly disposition; however, interviewing techniques.
these are applicable only to face-to-face interviews. If respondents give incomplete or unambiguous
At all times, interviewers should listen attentively answers, the interviewer should use tactful probes
to the respondent and should communicate this to to elicit a more complete answer (e.g., ‘‘Anything
the respondent via paraphrases, probes, nods, and else?’’ ‘‘In what ways?’’ ‘‘How?’’ ‘‘Can you elabo-
well-placed ‘‘uh-huhs’’ or ‘‘umms.’’ An interviewer rate a little more?’’). Probes must never be used to
should not interrupt a respondent’s silence that coerce or lead a respondent; rather, they should be
might occur because of thoughtful reflection or neutral, unbiased, and nondirective. Probes are
during an embarrassing conversation. Rather, he more common with open-ended questions. How-
or she should give the respondent sufficient time to ever, they can also be used with closed-ended ques-
resume the conversation on his or her own, or tions. For example, in a closed-ended question
important data might be lost. Additionally, when with a Likert scale, a respondent might give
interviewers are dealing with sensitive issues, they a response that cannot be classified on the scale.
should show some empathy with the respondent. The interviewer could then ask: ‘‘Do you strongly
When face-to-face interviews are being conducted, agree or strongly disagree?’’ There are several
they should be done without an audience, if possi- types of probes that can be used, namely, the silent
ble, to avoid distractions. probe (remaining silent until the respondent con-
To conduct the interview, the interviewer will tinues), the echo probe (repeating the last sentence
have to adopt a certain interview style (e.g., and asking the respondent to continue), the ‘‘uh-
unstructured, semistructured, or structured). This huh’’ probe (encouraging the respondent to con-
is determined by the research goal and is explained tinue), the tell-me-more probe (asking a question
during the training session. The style adopted has to get better insight), and the long question probe
implications for the amount of control that the (making your question longer to get more detailed
interviewer can exercise over people’s responses. In information).
qualitative research, interviews rely on what is At the conclusion of the interview, the inter-
referred to as an interview guide. An interview viewer should summarize the important points to
guide is a relatively unstructured list of general the respondent, allow the respondent sufficient
topics to be covered. Such guides permit great flex- time to refine or clarify any points, reassure the
ibility. In contrast, in quantitative research, an respondent that the information will remain confi-
interview schedule is used. An interview schedule dential, and thank the respondent for his or her
is a structured list of questions with explicit instru- time. Closure should be conducted in a courteous
ctions. Interview schedules are standardized. manner that does not convey abruptness to the
It is critically important that the interviewer interviewee. The respondent should be given the
follow the question wording for each question interviewer’s contact information. An official
exactly to ensure consistency across interviews and follow-up thank-you letter should also be sent
to minimize the possibility of interviewer bias. within 2 weeks. Immediately after the interview or
Additionally, it is important that open-ended as soon as possible thereafter, the interviewer
Interviewing 635
should update his or her recorded notes. This is interviewing technique. However, by the 1960s,
particularly important when some form of short- telephone interviewing started to gain popularity.
hand notation is used to record notes. This was followed by computer-assisted telephone
interviewing (CATI) in the 1970s, and computer-
assisted personal interviewing (CAPI) and
Interview Debriefing
computer-assisted self-interviewing (CASI) in the
Interview debriefing is important for obtaining 1980s. In CATI, an automated computer randomly
feedback on the interview process. Debriefing dials a telephone number. All prompts for intro-
can be held either in person or via telephone. The duction and the interview questions are displayed
debriefing process generally involves asking all on a computer screen. Once the respondent agrees
interviewers to fill out a questionnaire composed to participate, the interviewer records the answers
of both open-ended and closed-ended questions. A directly onto the computer. CAPI and CASI are
group meeting is subsequently held to discuss the quite similar to CATI but are used in face-to-face
group experiences. The debriefing session provides interviews. However, although CAPI is performed
valuable insight on problematic issues that require by the interviewer, with CASI, respondents either
correction before the next survey administration. can be allowed to type all the survey responses
onto the computer or can type the responses to
sensitive questions and allow the interviewer to
Interviewer Monitoring and Supervision
complete all other questions. Computer-assisted
To ensure quality control, interviewers should interviewing offers several advantages, including
be supervised and monitored throughout the study. faster recording and elimination of bulky storage;
Effective monitoring helps to ensure that unfore- however, these systems can be quite expensive to
seen problems are handled promptly, acts as set up and data can be lost if the system crashes
a deterrent to interview falsification, and assists and the data were not backed up. Other modern-
with reducing interviewer-related measurement day interviewing techniques include videophone
error. Good monitoring focuses on four main interviews, which closely resemble a face-to-face
areas: operational execution, interview quality, interview, except that the interviewer is remotely
interviewer falsification, and survey design. In gen- located, and e-mail interviews, which allow
eral, different types of monitoring are required for respondents to complete the interview at their
different interview techniques. For example, with convenience.
face-to-face interviews, interviewers might be
required to report to the principal investigator
after the execution of every 25 interviews, to turn Advantages and Disadvantages of Face-to-Face
in their data and discuss any special problems
and Telephone Interviews
encountered. In the case of telephone interviews,
the monitoring process is generally simplified The administration of a questionnaire by an inter-
because interviews are recorded electronically, and viewer has several advantages compared with
supervisors also have an opportunity to listen to administration by a respondent. First of all,
the actual interviews as they are being conducted. interviewer-administered surveys have a much
This permits quick feedback to the entire group on higher response rate than self-administered sur-
specific problems associated with issues such as (a) veys. The response rate for face-to-face interviews
voice quality (e.g., enunciation, pace, and volume) is approximately 80% to 85%, whereas for tele-
and (b) adherence to interview protocol (e.g., read- phone interviews, it is approximately 60%. This
ing verbatim scripts, using probes effectively, and might be largely attributable to the normal dynam-
maintaining neutrality). ics of human behavior. Many people generally feel
embarrassed in being discourteous to an inter-
viewer who is standing on their doorstep or is on
Types of Interviewing Techniques
the phone; however, they generally do not feel
Prior to the 1960s, paper-and-pencil (i.e., face-to- guilty about throwing out a mail survey as soon as
face) interviewing was the predominant type of it is received. Second, interviewing might help to
636 Intraclass Correlation
reduce ‘‘do not know’’ responses because the inter- Interviewer-Related Errors
viewer can probe to get a more specific answer.
Third, an interviewer can clarify confusing ques- The manner in which interviews are administered,
tions. Finally, when face-to-face interviews are as well as an interviewer’s characteristics, can
conducted, the interviewer can obtain other useful often affect respondents’ answers, which can lead
information, such as the quality of the dwelling (if to measurement error. Such errors are problematic,
conducted in the respondent’s home), respondent’s particularly if they are systematic, that is, when an
race, and respondent reactions. interviewer makes similar mistakes across many
Notwithstanding, interviewer-administrated interviews. Interviewer-related errors can be
surveys also have several disadvantages, namely, decreased through carefully worded questions,
(a) respondents have to give real-time answers, interviewer–respondent matching, proper training,
which means that their responses might not be continuous supervision or monitoring, and prompt
as accurate; (b) interviewers must have good ongoing feedback.
social skills to gain respondents’ cooperation Nadini Persaud
and trust; (c) improper administration and inter-
viewer characteristics can lead to interviewer- See also Debriefing; Ethnography; Planning Research;
related effects, which can result in measurement Protocol; Qualitative Research; Survey; Systematic
error, and (d) the cost of administration is con- Error
siderably higher (particularly for face-to-face
interviews) compared with self-administered
surveys. Further Readings
Babbie, E. (2005). The basics of social research (3rd ed.).
Belmont, CA: Thomson/Wadsworth.
Cost Considerations Erlandson, D. A., Harris, E. L., Skipper, B. L., & Allen,
S. D. (1993). Doing naturalistic inquiry: A guide to
The different interviewing techniques that can be methods. Newbury Park, CA: Sage.
used in research all have different cost implica- Ruane, J. M. (2005). Essentials of research methods: A
tions. Face-to-face interviews are undoubtedly the guide to social sciences research. Oxford, UK:
most expensive of all techniques because this pro- Blackwell Publishing.
cedure requires more interviewers (ratio of face-to- Schutt, R. K. (2001). Investigating the social world: The
face to telephone is approximately 4:1), more process and practice of research (3rd ed.). Thousand
interview time per interview (approximately 1 Oaks, CA: Pine Forge.
Weisberg, H. F., Krosnick, J. A., & Bowen, B. D. (1996).
hour), more detailed training of interviewers, and
An introduction to survey, research, polling and data
greater supervisor and coordination. Transporta- analysis (3rd ed.). Thousand Oaks, CA: Sage.
tion costs are also incurred with this technique.
Telephone interviews are considerably cheaper—
generally about half the cost. With this procedure,
coordination and supervision are much easier—
interviewers are generally all located in one room, INTRACLASS CORRELATION
printing costs are reduced, and sampling selection
cost is less because samples can be selected using The words intraclass correlation (ICC) refer to
random-digit dialing. These cost reductions greatly a set of coefficients representing the relationship
outweigh the cost associated with telephone calls. between variables of the same class. Variables of
Despite the significantly higher costs of face-to- the same class share a common metric and vari-
face interviews, this method might still be pre- ance, which generally means that they measure the
ferred for some types of research because response same thing. Examples include twin studies and
rates are generally higher and the quality of the two or more raters evaluating the same targets.
information obtained might be of a substantially ICCs are used frequently to assess the reliability of
higher quality compared with a telephone inter- raters. The Pearson correlation coefficient usually
view, which is quite impersonal. relates measures of different classes, such as height
Intraclass Correlation 637
and weight or stress and depression, and is an an ICC can be obtained for agreement. Within
interclass correlation. a couple, partner is a fixed variable—someone’s
Most articles on ICC focus on the computation partner can not be randomly select. Finally, there
of different ICCs and their tests and confidence is no question about averaging across partners,
limits. This entry focuses more on the uses of sev- so the reliability of an average is not relevant.
eral different ICCs. (In fact, ‘‘reliability’’ is not really the intent.)
The different ICCs can be distinguished along Table 2 gives the expected mean squares for
several dimensions: a one-way analysis of variance. The partner effect
can not be estimated separately from random
• One-way or two-way designs error.
• Consistency of order of rankings by different If each member of a couple had nearly the same
judges, or agreement on the levels of the score, there would be little within-couple variance,
behavior being rated and most of the variance in the experiment would
• Judges as a fixed variable or as a random be a result of differences between couples. If mem-
variable bers of a dyad differed considerably, the within-
• The reliability of individual ratings versus the
couple variance would be large and predominate.
reliability of mean ratings over several judges
A measure of the degree of relationship represents
the proportion (ρÞ of the variance that is between
One-Way Model couple variance. Therefore,
Although most ICCs involve two or more judges
σ 2C
rating n objects, the one-way models are differ- ρICC ¼ :
ent. A theorist hypothesizing that twins or gay σ 2C þ σ 2e
partners share roughly the same level of sociabil-
ity would obtain sociability data on both mem- The appropriate estimate for ρICC, using the
bers of 15 gay couples from a basic sociability obtained mean squares (MS), would be
index. A Pearson correlation coefficient is not
appropriate for these data because the data are MScouple MSw=in
exchangeable within couples—there is no logical rICC ¼ :
MScouple þ ðk 1ÞMSw=in
reason to identify one person as the first member
of the couple and the other as the second. The
For this sample data, the analysis of variance
design is best viewed as a one-way analysis of
summary table is shown in Table 3.
variance with ‘‘couple’’ as the independent vari-
able and the two measurements within each cou-
ple as the observations. Possible data are Table 2 Expected Mean Squares for One-Way Design
presented in Table 1. With respect to the dimen- Source df E(MS)
sions outlined previously, this is a one-way
Between couple n 1 kσ 2C þ σ 2e
design. Partners within a couple are exchange-
Within couple n(k 1) σ 2e
able, and thus a partner effect would have no
Partner error k —
meaning. Because there is no partners effect, an
(n 1)(k 1) —
ICC for consistency cannot be obtained, but only
638 Intraclass Correlation
Table 3 Summary Table for One-Way Design Table 4 Ratings of Compatibility for 15 Couples by
Four Raters
Source df Sum Sq Mean Sq F value
Between couple 14 2304.87 164.63 8.74 Raters
Within couple 15 282.50 18.83
Couples A B C D
Total 29 2587.37
1 15 18 15 18
2 22 25 20 26
MScouple MSw=in 3 18 15 10 23
ICC1 ¼ 4 10 7 12 18
MScouple þ ðk 1ÞMSw=in
5 25 22 20 30
164:63 18:83 145:80 6 23 28 21 30
¼ ¼ ¼ :795: 7 30 25 20 27
164:63 þ 18:83 183:46
8 19 21 14 26
A test of the null hypothesis that ρICC ¼ 0 can 9 10 12 14 16
be taken directly from the F for couples, which is 10 16 19 15 12
8.74 on (n 1) ¼ 14 and nðk 1) ¼ 15 degrees of 11 14 18 11 19
freedom (df). 12 23 28 25 22
This F can then be used to create confidence 13 29 21 23 32
limits on ρICC by defining 14 18 18 12 17
15 17 23 14 15
FL ¼ Fobs =F:975 ¼ 8:74=2:891 ¼ 3:023
FU ¼ Fobs × F:975 ¼ 8:74 × 2:949 ¼ 25:774: raters rate the compatibility of 15 married couples
based on observations of a session in which cou-
For FL, critical value is taken at α = .975 for (n – ples are asked to come to a decision over a question
1) and nðk – 1) degrees of freedom, but for FU, the of importance to both of them. Sample data are
degrees of freedom are reversed to obtain the criti- shown in Table 4.
cal value at α = .975 for nðk – 1) and (n – 1).
The confidence interval is now given by
Factors to Consider Before Computation
FL 1 FU 1
≤ρ Mixed Versus Random Models
FL þ ðk 1Þ FU þ ðk 1Þ
3:023 1 25:774 1 : As indicated earlier, there are several decisions
≤ρ≤
3:023 þ 1 25:774 þ 1 to make before computing an ICC. The first is
:503 ≤ ρ ≤ :925 whether raters are a fixed or a random variable.
Raters would be a fixed variable if they are the
Not only are members of the same couple simi- graduate assistants who are being trained to rate
lar in sociability, but the ICC is large given the couple compatibility in a subsequent experiment.
nature of the dependent variable. These are the only raters of concern. (The model
would be a mixed model because we always
assume that the targets of the ratings are sampled
Two-Way Models
randomly.) Raters would be a random variable if
The previous example pertains primarily to the sit- they have been drawn at random to assess whether
uation with two (or more) exchangeable measure- a rating scale we have developed can be used reli-
ments of each class. Two-way models usually ably by subsequent users. Although the interpreta-
involve different raters rating the same targets, and tion of the resulting ICC will differ for mixed and
it might make sense to take rater variance into random models (one can only generalize to subse-
account. quent raters if the raters one uses are sampled at
A generic set of data can be used to illustrate random), the calculated value will not be affected
the different forms of ICC. Suppose that four by this distinction.
Intraclass Correlation 639
Table 5 Analysis of Variance Summary Table for The analysis of variance for the data in Table 4
Two-Way Design is shown in Table 5.
Source df Sum Sq Mean Sq F value With either a random or mixed-effects model,
Row (Couple) 14 1373.233 98.088 10.225 the reliability of ratings in a two-way model for
Rater 3 274.850 91.617 9.551 consistency is defined as
Error 42 402.900 9.593
MSrow MSerror
Total 59 2050.983 ICCC;1 ¼ :
MSrow þ ðk 1ÞMSerror
The F for ρ = 0 is still the F for rows, with FL The notation ICCA;1 represents a measure of
and FU defined as for the one-way. The confidence agreement for the reliability of individual ratings.
limits on ρ become With consistency, individual raters could use differ-
ent anchor points, and rater differences were not
1 1 involved in the computation. With agreement,
CIL ¼ 1 ¼1 ¼ :785;
FL 4:656 rater differences matter, which is why MSrater
appears in the denominator for ICCA;1 .
and
Using the results presented in Table 5 gives
1 1
CIU ¼ 1 ¼1 ¼ :963:
FU 27:280 MSrow MSerror
ICCA;1 ¼ ;
It is important to think here about the implica- MSrow þ ðk 1ÞMSerror þ kðMSrater n MSerror Þ
tions of a measure of consistency. In this situation, 98:088 9:593
ICC(C,1) is reasonably high. It would be unchanged ¼ ;
98:088 þ ð4 1Þ9:593 þ 4ð91:61715 9:593Þ
(at .698) if a rater E was created by subtracting 15
points from rater D and then substituting rater E ¼ :595:
for rater D. If one were to take any of these judges
(or perhaps other judges chosen at random) to
a high school diving competition, their rankings The test on H0 : ρ = 0 is given by the F for rows
should more or less agree. Then, the winners would in the summary table and is again 10.255, which
be the same regardless of the judge. But suppose is significant on (n 1) and kðn 1) df. However
that one of these judges, either rater D or rater E, the calculation of confidence limits in this situation
was sent to that same high school but asked to rate is complex, and the reader is referred to McGraw
the reading ability of each student and make a judg- and Wong (1996) for the formulas involved.
ment of whether that school met state standards in If interest is, instead, the reliability of mean rat-
reading. Even though each judge would have ings, then
roughly the same ranking of children, using rater E
instead of rater D could make a major difference in
whether the school met state standards for reading. MSrow MSerror
ICCA;4 ¼ ;
Consistency is not enough in this case, whereas it MSrow þ ðMSrater MS
n
error Þ
and improve the test by identifying items that will examinees in the upper group get an item right
increase the ability of the test to discriminate and all examinees in the lower group fail. The
among the scores of those who take the test. In value of D equals 0 when the item is correctly
norm-referenced interpretations, the better an item answered by all, none, or any other same percent-
discriminates among examinees within the range age of examinees in both upper and lower groups.
of interest, the more information that item pro- D has a negative value when the percentage of stu-
vides. Discrimination among examinees is the dents answering the item correctly in the lower
most crucial characteristic desired in an item used group is greater than the percentage of correct
for a norm-referenced purpose. The discriminating responses in the upper group.
power is determined by the magnitude of the item A test with items having high D values pro-
discrimination index that will be discussed next. duces more spread in scores, therefore contribut-
ing to the discrimination in ability among
examinees. The item discrimination index is the
Discrimination Index
main factor that directly affects item selection in
The item discrimination index, which is usually norm-referenced tests. Ideally, items selected for
designated by the uppercase letter D (also net D, norm-referenced interpretations are considered
U-L, ULI, and ULD), shows the difference between good with D values above 30 and very good with
upper and lower scorers answering the item cor- values above 40. The reliability of the test will
rectly. It is the degree to which high scorers were be higher by selecting items that have higher
inclined to get each item right and low scorers item discrimination indices.
were inclined to get that item wrong. D is a mea- Warren Findley demonstrated that the index of
sure of the relationship between the item and the item discrimination is absolutely proportional to
total score, where the total score is used as a substi- the difference between the numbers of correct and
tute for a criterion of success on the measure being incorrect discriminations (bits of information) of
assessed by the test. For norm-referenced interpre- the item. Assuming 50 individuals in the upper
tations, there is usually no available criterion of and 50 individuals in the lower groups, the ideal
success because that criterion is what the test is item would distinguish each of 50 individuals in
supposed to represent. the U group from each in the L group. Thus,
Upper (U) and lower (L) groups can be decided 50 × 50 ¼ 2; 500 possible correct discriminations
by dividing the arranged descending scores into or bits of information. Considering an item on
three groups: upper, middle, and lower. When there which 45 individuals of the upper group but only
is a sufficient number of normally distributed scores, 20 individuals of the lower group answer the item
Truman Kelley demonstrated that using the top and correctly, the item would distinguish 45 individuals
bottom 27% (1.225 standard deviation units from who answered correctly from the upper group
the mean) of the scores as upper and lower groups from 30 individuals who answered incorrectly in
would be the best to provide a wide difference the lower group, generating a total of 45 × 30 ¼
between the groups and to have an adequate num- ; 350 correct discriminations. Consequently, 5 indi-
ber of scores in each group. When the total number viduals answering the item incorrectly in the upper
of scores is between 20 and 40, it is advised to select group are distinguished incorrectly from the 20
the top 10 and the bottom 10 scores. When the individuals who answered correctly in the lower
number of scores is less than or equal to 20, two group, generating 5 × 20 ¼ 100 incorrect discrimi-
groups are used without any middle group. nations. The net amount of effective discrimina-
One way of computing the item discrimination tions of the item is 1; 350 100 ¼ 1; 250
index is finding the difference of percentages of discriminations, which is 50% of the 2,500 total
correct responses of U and L groups by computing maximum possible correct discriminations. Note
Up Lp (U percentage minus L percentage). Typi- that the difference between the number of exami-
cally, this percentage difference is multiplied by nees who answer the item correctly in the upper
100 to remove the decimals; the result is D, which group and in the lower group is 45 20 ¼ 25,
yields values with the range of þ 100 and 100. which is half of the maximum possible difference
The maximum value of D ( þ 100) occurs when all with 50 in each group.
Item Analysis 643
Another statistic used to estimate the item dis- difference in abilities if it has a difficulty index of
crimination is the point-biserial correlation coeffi- 0, which means none of the examinees answer cor-
cient. The point-biserial correlation is obtained rectly, or 100, which means all of the examinees
when the Pearson product-moment correlation is answer correctly. Thus, items selected for norm-
computed from a set of paired values where one of referenced interpretations usually do not have dif-
the variables is dichotomous and the other is con- ficulty indices near the extreme values of 0 or 100.
tinuous. Correlation of passing or failing an indi-
vidual item with the overall test scores is the Criterion-Referenced Interpretations
common example for the point-biserial correlation
in item analysis. The Pearson correlation can be Criterion-referenced interpretations help to
calculated using SPSS or SAS. One advantage cited interpret the scores in terms of specified perfor-
for using the point-biserial as a measure of the mance standards. In psychometric terms, these
relationship between the item and the total score is interpretations are concerned with absolute rather
that it uses all the scores in the data rather than than relative measurement. The term absolute is
just the scores of selected groups of students. used to indicate an interest in assessing whether
a student has a certain performance level, whereas
relative indicates how a student compares with
Difficulty Index other students. For criterion-referenced purposes,
The item difficulty index, which is denoted as p; there is little interest in a student’s relative standing
is the proportion of the number of examinees within a group.
answering the item correctly to the total number of Item statistics such as item discrimination and
examinees. The item difficulty index ranges from item difficulty as defined previously are not used in
0 to 1. Item difficulty also can be presented as the same way for criterion-referenced interpreta-
a whole number by multiplying the resulting deci- tions. Although validity is an important consider-
mal by 100. The difficulty index shows the percent- ation in all test construction, the content validity
age of examinees who answer the item correctly, of the items in tests used for criterion-referenced
although the easier item will have a greater value. interpretations is essential. Because in many
Because the difficulty index increases as the diffi- instances of criterion-referenced testing the exami-
culty of an item decreases, some have suggested that nees are expected to succeed, the bell curve is gen-
it be called the easiness index. Robert Ebel suggested erally negatively skewed. However, the test results
computing item difficulty by finding the proportion vary greatly depending on the amount of instruc-
of examinees that answered the item incorrectly to tion the examinees have had on the content being
the total number of examinees. In that case, the tested. Item analysis must focus on group differ-
lower percentage would mean an easier item. ences and might be more helpful in identifying
It is sometimes desirable to select items with problems with instruction and learning rather than
a moderate spread of difficulty and with an aver- guiding the item selection process.
age difficulty index near 50. The difficulty index of
Discrimination Index
50 means that half the examinees answer an item
correctly and other half answer incorrectly. How- A discrimination index for criterion-referenced
ever, one needs to consider the purpose of the test interpretations is usually based on a different crite-
when selecting the appropriate difficulty level of rion than the total test score. Rather than using an
items. The desired average item difficulty index item–total score relationship as in norm-referenced
might be increased with multiple-choice items, analysis, the item–criterion relationship is more rel-
where an examinee has a chance to guess. The evant for a criterion-referenced analysis. Thus, the
item difficulty index is useful when arranging the upper and lower groups for a discrimination index
items in the test. Usually, items are arranged from should be selected based on their performance on
the easiest to the most difficult for the benefit of the criterion for the standards or curriculum of
examinees. interest. A group of students who have mastered
In norm-referenced interpretations, an item the skills of interest could comprise the upper
does not contribute any information regarding the group, whereas those who have not yet learned
644 Item Analysis
those skills could be in the lower group. Or a group norm-referenced interpretations, because it must
of instructed students could comprise the upper be known for what group the index was obtained.
group, with those who have had no instruction on The difficulty index of zero would mean that none
the content of interest comprising the lower group. of the examinees answered the item correctly,
In either of these examples, the D index would rep- which is informative regarding the ‘‘nonmastered’’
resent a useful measure to help discriminate the content and could be expected for the group that
masters versus the nonmasters of the topic. was not instructed on this content. The difficulty
After adequate instruction, a test on specified index of 100 would mean that all the examinees
content might result in very little score variation answered an item correctly, which would confirm
with many high scores. All of the examinees might the mastery of content by examinees who had
even get perfect scores, which will result in a dis- been well taught. Items with difficulties of 0 or
crimination index of zero. But all these students 100 would be rejected for norm-referenced pur-
would be in the upper group and not negatively poses because they do not contribute any informa-
impact the discrimination index if calculated tion about the examinee’s relative standing. Thus,
properly. the purpose of the test is important in using item
Similarly, before instruction, a group might all indices.
score very low, again with very little variation in
the scores. These students would be the lower
Effectiveness of Distractors
group in the discrimination index calculation.
Within each group, there might be no variation at One common procedure during item analysis is
all, but between the groups, there will be evidence determining the performance of the distractors
of the relationship between instruction and suc- (incorrect options) in multiple-choice items. Dis-
cess. Thus, item discrimination can be useful in tractors are expected to enhance the measurement
showing which items measure the relevance to properties of the item by being acceptable options
instruction, but only if the correct index is used. for the examinees with incomplete knowledge of
As opposed to norm-referenced interpretations, the content assessed by the item. The discrimina-
a large variance in performance of an instructed tion index is desired to be negative for distractors
group in criterion-referenced interpretation would and intended to be positive for the correct options.
probably indicate an instructional flaw or a learn- Distractors that are positively correlated with the
ing problem on the content being tested by the test total are jeopardizing the reliability of the test;
item. But variance between the instructed group therefore, they should be replaced by more appro-
and the not-instructed group is desired to demon- priate ones. Also, there is percent marked, (percent-
strate the relevance of the instruction and the sen- age upper þ percentage lower)/2, for each option
sitivity to the test for detecting instructional as a measure of ‘‘attractiveness.’’ If too few exami-
success. nees select an option, its inclusion in the test might
not be contributing to good measurement unless
the item is the one indicating mastery of the sub-
Difficulty Indexes
ject. Also, if a distractor is relatively more preferred
The difficulty index of the item in criterion- among the upper group examinees, there might be
referenced interpretations might be a relatively two possible correct answers.
more useful statistic than discrimination index in A successful distractor is the one that is attrac-
identifying what concepts were difficult to master tive to the members of the low-scoring group and
by students. In criterion-referenced testing, items not attractive to the members of the high-scoring
should probably have an average difficulty index group. When constructing a distractor, one should
around 80 or 90 within instructed groups. It is try to find the misconceptions related to the con-
important to consider the level of instruction cept being tested. Some methods of obtaining
a group has had before interpreting the difficulty acceptable options might be use of context termi-
index. nology, use of true statements for different
In criterion-referenced interpretations, a diffi- arguments, and inclusion of options of similar dif-
culty index of 0 or 100 is not as meaningless as in ficulty and complexity.
Item Response Theory 645
the number of dimensions assumed to underlie item parameters and classic item indices, an inte-
performance, the number of item characteristics gral must be calculated to obtain the probability
assumed to influence responses, and the mathemat- of a correct response. Allen Birnbaum proposed
ical form of the model relating the person and item a more mathematically tractable cumulative logis-
characteristics to the observed response. The item tic function. With an appropriate scaling factor,
score might be dichotomous (correct/incorrect), the normal ogive and logistic functions differ by
polytomous as in multiple-choice response or less than .05 over the entire trait continuum. The
graded performance scoring, or continuous as in logistic model has become widely accepted as the
a measured response. Dichotomous models have basic item response model for dichotomous and
been the most widely used models in educational polytomous responses.
contexts because of their suitability for multiple The unidimensional three-parameter logistic
choice tests. Polytomous models are becoming model for dichotomous responses is given by
more established as performance assessment
becomes more common in education. Polyto-
e1:7aj ðθbj Þ
mous and continuous response models are Pðuj ¼ 1|θÞ ¼ cj þ ð1 cj Þ ;
appropriate for personality or affective measure- 1 þ e1:7aj ðθbj Þ
ment. Continuous response models are not well
known and are not discussed here. where uj is the individual’s response to item j;
The models that are currently used most widely scored 1 for correct and 0 for incorrect, θ is the
assume that there is a single trait or dimension individual’s value on the trait being measured,
underlying performance; these are referred to as Pðuj ¼ 1|θÞ is the probability of a correct response
unidimensional models. Multidimensional models, to item j given θ, cj is the lower asymptote parame-
although well-developed theoretically, have not ter, aj is the item discrimination parameter, bj is
been widely applied. Whereas the underlying the item difficulty parameter, and 1.7 is the scaling
dimension is often referred to as ‘‘ability,’’ there is factor required to scale the logistic function to the
no assumption that the characteristic is inherent or normal ogive. The curve produced by the model is
unchangeable. called the item characteristic curve (ICC). ICCs for
Models for dichotomous responses incorporate several items with differing values of the item para-
one, two, or three parameters related to item char- meters are shown in Figure 1. The two-parameter
acteristics. The simplest model, which is the one- model is obtained by omitting the lower asymp-
parameter model, is based on the assumption that tote parameter, and the one-parameter model is
the only item characteristic influencing an indivi- obtained by subsequently omitting the discrimi-
dual’s response is the difficulty of the item. A nation parameter. The two-parameter model
model known as the Rasch model has the same assumes that low-performing individuals have
form as the one-parameter model but is based on no chance of answering the item correctly
different measurement principles. The Rasch the- through guessing, whereas the one-parameter
ory of measurement was popularized in the United model assumes that all items are equally
States by Benjamin Wright. The two-parameter discriminating.
model adds a parameter for item discrimination, The lower asymptote parameter is bounded by
reflecting the extent to which the item discrimi- 0 and 1 and is usually less than .3 in practice. The
nates among individuals with differing levels of discrimination parameter is proportional to
the trait. The three-parameter model adds a lower the slope of the curve at its point of inflection; the
asymptote or pseudo-guessing parameter, which steeper the slope, the greater the difference in
gives the probability of a correct response for an probability of correct response for individuals of
individual with an infinitely low level of the trait. different trait levels, hence, the more discriminat-
The earliest IRT models used a normal ogive ing the item. Discrimination parameters must be
function to relate the probability of a correct positive for valid measurement. Under the one-
response to the person and item characteristics. and two-parameter models, the difficulty param-
Although the normal ogive model is intuitively eter represents the point on the trait continuum
appealing and provides a connection between IRT where the probability of a correct response is 0.5;
Item Response Theory 647
1.0
0.9
Probability of Correct Response
0.8
a =1.0, b = −1.0, c = 0.2
0.7
0.6
0.3
0.0
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Trait Value
Figure 1 Item Characteristic Curves for Three Items Under the Three-Parameter Logistic Item Response Theory
Model
under the three-parameter model, the probability well known are the graded response model
of a correct response at θ ¼ bj is (1 + cj Þ/2. (GRM), the partial credit model (PCM), the gener-
Note the indeterminacy in the model previously: alized partial credit model (GPCM), and the rating
The model does not specify a scale for a; b; and θ. scale model (RSM). Only the GRM and the
A linear transformation of parameters will pro- GPCM are described here.
duce the same probability of a correct response, The GRM is obtained by formulating two-
that is, if θ * ¼ Aθ þ B, b * ¼ Ab þ B, and parameter dichotomous models for the probability
a * ¼ a=A, then a * ðθ * b * Þ ¼ aðθ bÞ and that an examinee will score in each response cate-
Pðθ * Þ ¼ PðθÞ. The scale for parameter estimates is gory or higher (as opposed to a lower category),
typically fixed by standardizing on either the θ then subtracting probabilities to obtain the proba-
values or the b values. With this scaling, the θ and bility of scoring within each response category,
b parameter estimates generally fall in the range that is,
( 3, 3) and the a parameter estimates are gener-
ally between 0 and 2. eaj ðθbjk Þ
Pðuj ≥ k|θÞ ¼ Pk* ðθÞ ¼ ;
There are several item response models for 1 þ eaj ðθbjk Þ
polytomous responses. When there is no assump- k ¼ 1; . . . ; m 1; P0* ðθÞ ¼ 1;
tion that the response categories are on an ordered
scale, the nominal response model might be used Pm* ðθÞ ¼ 0;
to model the probability that an individual will Pðuj ¼ kÞ ¼ Pk* ðθÞ Pkþ1
*
ðθÞ:
score in a particular response category. The nomi-
nal response model is not widely used because Here, responses are scored 0 through m – 1, where
polytomously scored responses are generally m is the number of response categories, k is the
ordered in practice, as, for example, in essay or response category of interest, aj is the discrimina-
partial credit scoring. There are several models tion parameter, interpreted as in dichotomous
for ordered polytomous responses: The most models, and bjk is the category parameter. The
648 Item Response Theory
1.0
0.8
Probability of Response
0.7
u=0 u =1 u=2 u =3
0.6
0.5
0.4
0.3
0.2
0.1
0.0
−3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Trait Value
Figure 2 Item Response Category Characteristic Curves for an Item Under the GPCM
category parameters represent the level of the trait not be ordered under this model. The GPCM is
needed to have a 50% chance of scoring in that a generalization of the PCM, which assumes equal
category or higher. Category parameters are neces- discriminations across items and omits the discrim-
sarily ordered on the trait scale. It is assumed that ination parameter in the model. An example of
the item is equally discriminating across category IRCCs for a polytomous item under the GPCM is
boundaries. The model provides a separate item shown in Figure 2.
response function for each response category; the Item response theory has several advantages
resultant curves are called item response category over classic test theory in measurement applica-
characteristic curves (IRCCs). tions. First, IRT item parameters are invariant
The GPCM differs from the GRM in that it is across subpopulations, whereas classic item indices
based on a comparison of adjacent categories. The change with the performance level and heterogene-
model is given by ity of the group taking the test. Person parameters
are invariant across subsets of test items measuring
P
k
the same dimension; whether the test is easy or
aj ðθbjv Þ
e v¼1 hard, an individual’s trait value remains the same.
Pðuj ¼ k|θÞ ¼ Pjk ðθÞ; ¼ P
c ; This is not the case with total test score, which
P aj
m1 ðθbjv Þ
depends on the difficulty of the test. The invari-
1þ e v¼1
c¼1 ance property is the most powerful feature of item
1 response models and provides a solid theoretical
k ¼ 1; . . . ; m 1; Pj0 ðθÞ ¼ P
c : base for applications such as test construction,
P aj
m1 ðθbjv Þ
equating, and adaptive testing. Note that invari-
1þ e v¼1
c¼1 ance is a property of the parameters and holds
only in the population; estimates will vary across
In this model, the category parameter bjk repre- samples of persons or items.
sents the trait value at which an individual has an A second advantage of IRT is individualized
equal probability of scoring in category k versus standard errors of measurement, rather than
category (k – 1). The category parameters need a group-based measure such as is calculated in
Item Response Theory 649
Hambleton R. K., Swaminathan, H., & Rogers H. J. type of reliability. The item-test correlation is one
(1991). Fundamentals of item response theory. of many item discrimination indices used in item
Newbury Park, CA: Sage. analysis.
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item Because item responses are typically scored as
response theory and health outcomes measurement in
zero when incorrect and unity (one) if correct,
the 21st century. Medical Care, 38, II28–II42.
Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item
the item variable is binary or dichotomous
response theory: Application to psychological (having two values). The resulting correlation is
measurement. Homewood, IL: Dow Jones-Irwin. properly called a point-biserial coefficient when
Kolen, M. J., & Brennan, R. L. (2004). Test equating, a binary item is correlated with a total score that
scaling, and linking: Methods and practices (2nd ed.). has more than two values (called polytomous or
New York: Springer-Verlag. continuous). However, some items, especially
Lord, F. M. (1980). Applications of item response theory essay items, performance assessments, or those
to practical testing problems. Hillsdale, NJ: Lawrence for inclusion in affective scales, are not usually
Erlbaum.
dichotomous, and thus some item-test correla-
Lord, F. M., & Novick, M. R. (1968). Statistical theories
tions are regular Pearson coefficients between
of mental test scores. Reading, MA: Addison-Wesley.
Ostini, R., & Nering, M. L. (2006). Polytomous item polytomous items and total scores. The magni-
response theory models. Thousand Oaks, CA: Sage. tude of correlations found when using polyto-
Rasch, G. (1960). Probabilistic models for some mous items is usually greater than that observed
intelligence and attainment tests. Chicago: University for dichotomous items. Reliability is related to
of Chicago Press. the magnitude of the correlations and to the
Van der Linden, W. J., & Glas, C. A.W. (2000). number of items in a test, and thus with polyto-
Computerized adaptive testing: Theory and practice. mous items, a lesser number of items is usually
Boston: Kluwer. sufficient to produce a given level of reliability.
Wainer, H. (Ed.). (2000). Computerized adaptive testing:
Similarly, to the extent that the average of
A primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
the item-test correlations for a set of items is
Wright B. D., & Stone M. H. (1979). Best test design:
Rasch measurement. Chicago: MESA Press. increased, the number of items needed for a reli-
able test is reduced.
All correlations tend to be higher in groups that
have a wide range of talent than in groups where
there is a more restricted range. In that respect, the
ITEM-TEST CORRELATION item-test correlation presents information about
the group as well as about the item and the test.
The item-test correlation is the Pearson correlation The range of talent in the group might be limited
coefficient calculated for pairs of scores where one in some samples, for example, in a group of stu-
item of each pair is an item score and the other dents who have all passed prerequisites for an
item is the total test score. The greater the value advanced class. In groups where a restriction of
of the coefficient, the stronger is the correlation range exists, the item-test correlations will provide
between the item and the total test. Test developers a lower estimate of the relationship between the
strive to select items for a test that have a high cor- item and the test.
relation with the total score to ensure that the test When the range of talent in the group being
is internally consistent. Because the item-test corre- tested is not restricted, the item-test correlation is
lation is often used to support the contention that a spurious measure of item quality. The spurious-
the item is a ‘‘good’’ contributor to what the test ness arises from the inclusion of the particular item
measures, it has sometimes been called an index of in the total test score, resulting in the correlation
item validity. That term applies only to a type of between an item and itself being added to the cor-
evidence called internal structure validity, which is relation between the item and the rest of the total
synonymous with internal consistency reliability. test score. A preferred concept might be the item–
Because the item-test correlation is clearly an index rest correlation, which is the correlation between
of internal consistency, it should be considered as the item and the sum of the rest of the item scores.
a measure of item functioning associated with that Another term for this item-rest correlation is the
652 Item-Test Correlation
corrected item-test correlation, the name given to item–subtotal correlations, the researcher might
this type of index in the SPSS Scale Reliability develop (or examine) a measure that was inter-
analysis (SPSS, an IBM company). nally consistent for each subset of items. To get
The intended use of the test is an important an overall measure of reliability for such a multi-
factor in interpreting the magnitude of an item- trait test, a stratified alpha coefficient would be
test correlation. If the intention is to develop the desired reliability estimate rather than the
a test with high criterion-related test validity, regular coefficient alpha as reported by SPSS
one might seek items that have high correlations Scale Reliability. In the case of multitrait tests,
with the external criterion but relatively lower using a regular item–total correlation for item
correlations with the total score. Such items pre- analysis would likely present a problem because
sumably measure aspects of the criterion that are the subset with the most items could contribute
not adequately covered by the rest of the test too much to the total test score. This heavier
and could be preferred to items correlating concentration of similar items in the total score
highly with both the criterion and the test score. would result in the items in the other subsets
However, unless item-test correlations are sub- having lower item-test correlations and might
stantial, the tests composed of items with high lead to their rejection.
item-criterion correlations might be too hetero- In analyzing multiple-choice items to deter-
geneous in content to provide meaningful inter- mine how each option might contribute to the
pretations of the test scores. Thus, the use of reliability of the total test, one can adapt the
item-test correlations to select items for a test is item-test correlation to become an option-test
based on the goal to establish internal consis- correlation. In this approach, each option is
tency reliability rather than direct improvement examined to determine how it correlates with
in criterion validity. the total test (or subtest). If an option is expected
The item-test correlation resembles the load- to be a distractor (wrong response), the option–
ing of the item on the first principal component test correlation should be negative (this assumes
or the unrotated first component of an analysis the option is scored 1 if selected, 0 if not
of all the items in a test. The concepts are related selected). Distractors that are positively corre-
but the results are not identical in these different lated with the test or subtest total are detracting
approaches to representing the ‘‘loading’’ or from the reliability of the measure; these options
‘‘impact’’ of an item. However, principal compo- should probably be revised or eliminated.
nents analysis might help the reader to under- The item-test correlation was first associated
stand the rationale of measuring the relationship with the test analysis based on classic test theory.
between an item and some related measure. Typ- Another approach to test analysis is called item
ically, using any of these methods, the researcher response theory (IRT). With IRT, the relation-
wants the item to represent the same trait as the ship between an item and the trait measured by
total test, component, or factor of interest. This the total set of items is usually represented by an
description is limited to a single-factor test or item characteristic curve, which is a nonlinear
a single-component measure just as the typical regression of the item on the measure of ability
internal consistency reliability is reported for representing the total test. An index called the
a homogeneous test measuring a single trait. If point-biserial correlation is sometimes computed
the trait being measured is not a unitary trait, in an IRT item analysis statistical program, but
then other approaches are suggested because the that correlation might not be exactly the same as
regular item–total correlation will underestimate the item-test correlation. The IRT index might
the value of an item. be called an item-trait or an item-theta correla-
If a researcher has a variable that is expected tion, because the item is correlated with a mea-
to measure more than one trait, then the item– sure of the ability estimated differently than
total correlation can be obtained separately for based on the total score. Although this technical
each subset of items. In this approach, each difference might exist, there is little substantial
total represents the subtotal score for items of difference between the item-trait and the item-
a subset rather than a total test score. Using test correlations. The explanations in this entry
Item-Test Correlation 653
655
656 Jackknife
T¼f ðX1 ; . . . ;Xn ; . . . ;XN Þ: ð1Þ with tα;v being the α-level critical value of a Stu-
dent’s t distribution with v ¼ N1 degrees of
An estimation of the population parameter freedom.
obtained without the nth observation is called
the nth partial prediction and is denoted Tn :
Jackknife Without Pseudovalues
Formally:
Pseudovalues are important for understanding the
Tn ¼f ðX1 ; . . . ;Xn1 ;Xnþ1 . . . ;Xn Þ: ð2Þ inner working of the jackknife, but they are not
computationally efficient. Alternative formulas using
A pseudovalue estimation of the nth observation is only the partial estimates can be used in lieu of the
denoted Tn* ; it is computed as the difference pseudovalues. Specifically, if T · denotes the mean of
between the parameter estimation obtained from the partial estimates and σ^ Tn their standard devia-
the whole sample and the parameter estimation tion, then T * (cf. Equation 4) can be computed as
obtained without the nth observation. Formally:
T * ¼ NT ðN 1ÞT • ð8Þ
Tn* ¼ NT ðN 1ÞTn : ð3Þ
and σ^ T* (cf. Equation 6) can be computed as
The jackknife estimate of θ, denoted T*, is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
obtained as the mean of the pseudovalues. N 1X 2
σ^T * ¼ Tn T •
Formally: N ð9Þ
σ^Tn
¼ ðN 1Þ pffiffiffiffi ffi:
* 1XN
N
T * ¼ T• ¼ T*; ð4Þ
N n n
* Assumptions of the Jackknife
where T • is the mean of the pseudovalues. The
variance of the pseudovalues is denoted σ^ 2T * and is Although the jackknife makes no assumptions
n
obtained with the usual formula: about the shape of the underlying probability
Jackknife 657
distribution, it requires that the observations are sometimes a source of confusion). The first tech-
independent of each other. Technically, the obser- nique, presented in the preceding discussion, esti-
vations are assumed to be independent and identi- mates population parameters and their standard
cally distributed (i.e., in statistical jargon: i.i.d.). error. The second technique evaluates the general-
This means that the jackknife is not, in general, an ization performance of predictive models. In these
appropriate tool for time-series data. When the models, predictor variables are used to predict the
independence assumption is violated, the jackknife values of dependent variable(s). In this context, the
underestimates the variance in the data set, which problem is to estimate the quality of the prediction
makes the data look more reliable than they actu- for new observations. Technically speaking, the
ally are. goal is to estimate the performance of the predic-
Because the jackknife eliminates the bias by tive model as a random effect model. The problem
subtraction (which is a linear operation), it works of estimating the random effect performance for
correctly only for statistics that are linear functions predictive models is becoming a crucial problem in
of the parameters or the data, and whose distribu- domains such as, for example, bio-informatics and
tion is continuous or at least ‘‘smooth enough’’ to neuroimaging because the data sets used in these
be considered as such. In some cases, linearity can domains typically comprise a very large number of
be achieved by transforming the statistics (e.g., variables (often a much larger number of variables
using a Fisher Z transform for correlations or a log- than observations—a configuration called the
arithm transform for standard deviations), but ‘‘small N; large P’’ problem). This large number of
some nonlinear or noncontinuous statistics, such variables makes statistical models notoriously
as the median, will give very poor results with the prone to overfitting.
jackknife no matter what transformation is used. In this context, the goal of the jackknife is to
estimate how a model would perform when
Bias Estimation applied to new observations. This is done by drop-
ping in turn each observation and fitting the model
The jackknife was originally developed by Que- for the remaining set of observations. The model is
nouille as a nonparametric way to estimate and then used to predict the left-out observation. With
reduce the bias of an estimator of a population this procedure, each observation has been pre-
parameter. The bias of an estimator is defined as dicted as a new observation.
the difference between the expected value of this In some cases a jackknife can perform both
estimator and the true value of the population functions, thereby generalizing the predictive
parameter. So formally, the bias, denoted , of an model as well as finding the unbiased estimate of
estimation T of the parameter θ is defined as the parameters of the model.
¼ EfT g θ; ð10Þ
Example: Linear Regression
with EfTg being the expected value of T:
The jackknife estimate of the bias is computed Suppose that we had performed a study examining
by replacing the expected value of the estimator the speech rate of children as a function of their
[i.e., EfTg] by the biased estimator (i.e., TÞ and by age. The children’s age (denoted XÞ would be used
replacing the parameter (i.e., θ) by the ‘‘unbiased’’ as a predictor of their speech rate (denoted YÞ.
jackknife estimator (i.e., T*). Specifically, the Dividing the number of words said by the time
jackknife estimator of the bias, denoted jack , is needed to say them would produce the speech rate
computed as (expressed in words per minute) of each child. The
results of this (fictitious) experiment are shown in
jack ¼TT * : ð11Þ Table 1.
We will use these data to illustrate how the
jackknife can be used to (a) estimate the regression
Generalizing the Performance of Predictive Models
parameters and their bias and (b) evaluate the gen-
Recall that the name jackknife refers to two eralization performance of the regression model.
related, but different, techniques (and this is As a preliminary step, the data are analyzed by
658 Jackknife
Table 1 Data From a Study Examining the Speech an* ¼ na ðn 1Þan and
Rate of Children as a Function of Age ð14Þ
bn* ¼ nb ðn 1Þbn ;
Xn Yn ^n
Y ^
Y ^
Y
n jack,n
1 4 91 95.0000 9 4.9986 97.3158 and for the first observation, this equation
2 5 96 96.2500 96.1223 96.3468 becomes
3 6 103 97.5000 97.2460 95.9787
a1* ¼ 6 × 1:25 5 × 0:9342 ¼ 2:8289 and
4 9 99 101.2500 100.6172 101.7411
5 9 103 101.2500 100.6172 100.8680 b1* ¼ 6 × 90 5 × 93:5789 ¼ 72:1053:
6 15 108 108.7500 107.3596 111.3962 ð15Þ
Notes: The independent variable is the age of the child (XÞ.
The dependent variable is the speech rate of the child in Table 2 gives the partial estimates and pseudo-
words per minutes (YÞ. The values of Y ^ are obtained as Y
^¼ values for the intercept and slope of the regression.
90 þ 1.25X: Xn is the value of the independent variable, From this table, we can find that the jackknife esti-
^ n is the value of the dependent variable, Yn is the predicted
Y
mates of the regression will give the following
value of the dependent variable predicted from the
regression, Y^ * is the predicted value of the dependent equation for the prediction of the dependent vari-
n
variable predicted from the jackknife derived unbiased able (the prediction using the jackknife estimates is
estimates, and Yjack is the predicted values of the dependent denoted Y^ * Þ:
n
variable when each value is predicted from the
corresponding jackknife partial estimates. ^ * ¼ a * þ b * X ¼ 90:5037 þ 1:1237X:
Y ð16Þ
n
Table 2 Partial Estimates and Pseudovalues for the Regression Example of Table 1
transformed back to a correlation, produces a value The bias of the estimate is computed from Equa-
of the jackknife estimate for the correlation of r* = tion 11. For example, the bias of the estimation of
.7707. Incidently, this value is very close to the the coefficient of correlation is equal to
value obtained with another classic alternative
population unbiased estimate called the shrunken jack ðrÞ ¼r r* ¼ :8333 :7707 ¼ :0627: ð20Þ
r; which is denoted ~r, and computed as
The bias is positive, and this shows (as expected)
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
that the coefficient of correlation overestimates the
ðN 1Þ
~r ¼ 1 ð1 r Þ 2 magnitude of the population correlation.
ðN 2Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
5
Estimate of the Generalization
¼ 1 ð1 :8333 Þ 2 ¼ :7862: ð18Þ
4
Performance of the Regression
To estimate the generalization performance of the
Confidence intervals are computed using regression, we need to evaluate the performance of
Equation 7. For example, taking into account the model on new data. These data are supposed to
that the α = .05 critical value for a Student’s t be randomly selected from the same population as
distribution for v = 5 degrees of freedom is equal the data used to build the model. The jackknife
to tα;v = 2.57, the confidence interval for the strategy here is to predict each observation as a new
intercept is equal to observation; this implies that each observation is
predicted from its partial estimates of the prediction
10:6622 parameter. Specifically, if we denote by Yjack;n the
a* ± tα;ν σ^ a* ¼ 90:5037 ± 2:57 × pffiffiffi jackknife predicted value of the nth observation, the
6 jackknife regression equation becomes
ð19Þ
¼ 90:5037 ± 2:57 × 4:3528
¼ 90:5037 ± 11:1868: ^ jack; n ¼ an þ bn Xn :
Y ð21Þ
660 John Henry Effect
So, for example, the first observation is predicted statistical sciences (Vol. 4, pp. 280–287). New York:
from the regression model built with observations Wiley.
2 to 6; this gives the following predicting equation Manly, B. F. J. (1997). Randomization, bootstrap, and
for Yjack;1 (cf. Tables 1 and 2): Monte Carlo methods in biology (2nd ed.). New York:
Chapman & Hall.
Miller, R. G. (1974). The jackknife: A review.
^ jack; 1 ¼ a1 þ b1 X1 ¼ 93:5789
Y Biometrika, 61; 1–17.
ð22Þ Quenouille, M. H. (1956). Notes on bias in estimation.
þ 0:9342 × 4 ¼ 97:3158:
Biometrika, 43; 353–360.
Shao, J., & Tu, D. (1995). The jackknife and the
The jackknife predicted values are listed in Table bootstrap. New York: Springer-Verlag.
1. The quality of the prediction of these jackknife Tukey, J. W. (1958). Bias and confidence in not quite
values can be evaluated, once again, by computing large samples (abstract). Annals of Mathematical
a coefficient of correlation between the predicted Statistics, 29; 614.
values (i.e., theYjack;n Þ and the actual values (i.e., Tukey, J. W. (1986). The future of processes of data
the Yn Þ. This correlation, denoted rjack , for this analysis. In The collected works of John W. Tukey
(Vol. IV, pp. 517–549). New York: Wadsworth.
example is equal to rjack = .6825. It is worth noting
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009).
that, in general, the coefficient rjack is not equal to Puzzlingly high correlations in f MRI studies of
the jackknife estimate of the correlation r * (which, emotion, personality, and social cognition.
recall, is in our example equal to r * = .7707). Perspectives in Psychological Sciences, 4; 274–290.
teaching. Heinich noted that many of these studies of Henry in which an unexpected result develops
demonstrated insignificant differences between from the group’s overexertion or out-of-the-
control and experimental groups and often ordinary performance. With both Henry and the
included results in which the control group outper- control group, the fear of being replaced incites
formed the experimental condition. He was one of the spirit of competition and leads to an inaccurate
the first researchers to acknowledge that the valid- depiction of the differences in performance by the
ity of these experiments was compromised by the control and experimental groups.
control groups’ knowledge of their role as a control The effect was later studied extensively by Gary
or baseline comparison group. Comparing the con- Saretsky, who expanded on the term’s definition
trol groups with the title character from the ‘‘Bal- by pointing out the roles that both competition
lad of John Henry,’’ Heinich described how and fear play in producing the effect. In most
a control group might exert extra effort to com- cases, the John Henry effect is perceived as a con-
pete with or even outperform its comparison trol group’s resultant behavior to the fear of being
group. outperformed or replaced by the new strategy or
In the ‘‘Ballad,’’ title character John Henry novel technology.
works as a rail driver whose occupation involves
hammering spikes and drill bits into railroad ties Samantha John
to lay new tracks. John Henry’s occupation is
See also Control Group; Hawthorne Effect
threatened by the invention of the steam drill,
a machine designed to do the same job in less
time. The ‘‘Ballad of John Henry’’ describes an Further Readings
evening competition in which Henry competes
with the steam drill one on one and defeats it by Heinich, R. (1970). Technology & the management of
laying more track. Henry’s effort to outperform instruction. Washington, DC: Association for
Educational Communications and Technology.
the steam drill causes a misleading result, how-
Saretsky, G. (1975). The John Henry Effect: Potential
ever, because although he did in fact win the confounder of experimental vs. control group
competition, his overexertion causes his death approaches to the evaluation of educational
the next day. innovations. Presented at the annual meeting of the
Heinich’s use of the folktale compares the per- American Educational Research Association,
formance by a control group with the performance Washington, DC.
K
K
goodness-of-fit tests. An example illustrating the
KOLMOGOROV–SMIRNOV TEST application and evaluation of a KS test is also
provided.
663
664 Kolmogorov–Smirnov Test
test that is used to test whether two separate sets test uses a different measure of disparity. The KS
of data have the same distribution. As an example, test uses the maximum distance between the
one could have a set of scores for males and a set empirical distribution function of the data and the
of scores for females. The two-sample KS test hypothesized distribution. The following example
could be used to determine whether the distribu- clarifies ideas.
tion of male scores is the same as the distribution
of female scores. The two-sample KS test does not
Example
require that the form of the hypothesized common
distribution be specified. One does not need to Suppose that one wishes to test whether the 50
specify whether the distribution is normal, expo- randomly sampled data in Table 1 have a standard
nential, and so on, and no parameters are esti- normal distribution (i.e., normal with mean ¼ 0
mated. The two-sample KS test is distribution free, and variance ¼ 1). The smooth curve in Figure 1
so just one table of critical values suffices. The KS shows the cumulative distribution function (CDF)
test has been extended further to test the equality of the standard normal distribution. That is, for
of distributions when the number of samples each x on the horizontal axis, the smooth curve
exceeds two. For example, one could have scores shows the standard normal probability less than or
from several different cities. equal to x: These are the values displayed in every
table of standard normal probabilities found in
statistics textbooks. For example, if x ¼ 2, the
Goodness-of-Fit Tests
smooth curve has a value of 0.9772, indicating
In all goodness-of-fit tests, there is a null hypothe- that 97.72% of the values in the distribution are
sis that states that the data have some distribution found below 2 standard deviations above the
(e.g., normal). The alternative hypothesis states mean, and only 2.28% of the values are more than
that the data do not have that distribution (e.g., 2 standard deviations above the mean. The jagged
not normal). In most empirical research, one hopes line in the figure shows the empirical distribution
to conclude that the data have the hypothesized function (EDF; i.e., the proportion of the 50 data
distribution. But in empirical research, one cus- less than or equal to each x on the horizontal
tomarily sets up the research hypothesis as the axis). The EDF is 0 below the minimum data value
alternative hypothesis. In goodness-of-fit tests, this and is 1 above the largest data value and steps up
custom is reversed. The result of the reversal is by 1=50 ¼ 0:02 at each data value from left to
that a goodness-of-fit test can provide only a weak right.
endorsement of the hypothesized distribution. The If the 50 data come from a standard normal dis-
best one can hope for is a conclusion that the tribution, then the smooth curve and the jagged
hypothesized distribution cannot be rejected, or line should be close together because the empirical
that the data are consistent with the hypothesized proportion of data less than or equal to each x
distribution. Why not follow custom and set up should be close to the proportion of the true distri-
the hypothesized distribution as the alternative bution less than or equal to each x: If the true dis-
hypothesis? Then rejection of the null hypothesis tribution is standard normal, then any disparity or
would be a strong endorsement of the hypothe- gap between the two lines should be attributable
sized distribution at a specified low Type I error to the random variation of sampling, or to the dis-
probability. The answer is that it is too hard to dis- creteness of the 0.02 jumps at each data value.
prove a negative. For example, if the hypothesized However, if the true data distribution is not stan-
distribution were standard normal, then the test dard normal, then the true CDF and the standard
would have to disprove all other distributions. The normal smooth curve of Figure 1 will differ. The
Type I error probability would be 100%. EDF will be closer to the true CDF than to the
Like all goodness-of-fit tests, the KS test is based smooth curve. And a persistent gap will open
on a measure of disparity between the empirical between the two curves of Figure 1, provided the
data and the hypothesized distribution. If the dis- sample size is sufficiently large. Thus, if there is
parity exceeds a critical cutoff value, the hypothe- only a small gap between the two curves of the fig-
sized distribution is rejected. Each goodness-of-fit ure, then it is plausible that the data come from
Kolmogorov–Smirnov Test 665
Table 1 Example of KS Test (50 Randomly Sampled to measure the gap. The area between the lines
Data, Arranged in Increasing Order) and square of the area betweenthe lines are among
1:7182 0:9339 0:2804 0.2694 1.0030 the measures that are used in other goodness-of-fit
1:7144 0:9326 0:2543 0.3230 1.0478 tests. In most cases, statistical software running on
1:6501 0:8866 0:2093 0.4695 1.0930 a computer will probably calculate both the value
1:5493 0:6599 0:1931 0.5368 1.2615 of the KS test and the critical value or p value. If
1:4843 0:5801 0:1757 0.5686 1.2953 the computation of the KS test is done by hand, it
1:3246 0:5559 0:1464 0.7948 1.4225 is necessary to exercise some care because of the
1:3173 0:4403 0:0647 0.7959 1.6038 discontinuities or jumps in the EDF at each data
1:2435 0:4367 0:0594 0.8801 1.6379 value.
1:1507 0:3205 0:0786 0.8829 1.6757
0:9391 0:3179 0:1874 0.9588 1.6792 Calculation of the KS Statistic
The KS statistic that measures the gap can be
calculated in general by the formula
1.0
0.9 Dn ¼ maxðDþ
n , Dn Þ,
0.8
0.7 in which
0.6
CDF/EDF
0.5 Dþ
n ¼ max ½i=n F0 ðxi Þ
0.4 1≤i≤n
0.3
0.2 and
0.1
0.0 x D
n ¼ max ½F0 ðxi Þ ði 1Þ=n:
−3.0 −2.0 −1.0 0.0 1.0 2.0 3.0 1≤i≤n
at all x; or it is not. Statisticians have produced distribution and the true distribution unless a sub-
tables of critical values for small sample sizes. If stantially larger sample size is used than is common
the sample size is at least moderately large (say, in much empirical research. Second, failure to
n ≥ 35Þ, then asymptotic approximations might be reject the hypothesized distribution might be only
pffiffiffi
used. For example, PðDn > 1:358= nÞ ¼ 0:05 a weak endorsement of the hypothesized distribu-
pffiffiffi
and PðDn > 1:224= nÞ ¼ 0:05 for large n: Thus, tion. For example, if one tests the hypothesis that
pffiffiffi
1:358= n is an approximate critical value for the a set of regression residuals has a normal distribu-
KS test at the 0.05 level of significance, and tion, one should perhaps not take too much com-
pffiffiffi fort in the failure of the KS test to reject the
1:224= n is an approximate critical value for the
KS test at the 0.10 level of significance. normal hypothesis. Third, if the research objective
In the example, the maximum gap between the can be satisfied by testing a small set of specific
EDF and the hypothesized standard normal distri- parameters, it might be overkill to test the entire
bution of Figure 1 is 0.0866. That is, the value of distribution. For example, one might want to know
the KS statistic Dn is 0.0866. Because n ¼ 50, the whether the mean of male scores differs from the
critical value for a test at the 0.10 mean of female scores. Then it would probably be
ffi significance level
pffiffiffiffiffi
is approximately 1:224= 50 ¼ 0:1731. Because better to test equality of means (e.g., by a t test)
Dn ¼ 0:0866 < 0:1731, then it could be concluded than to test the equality of distributions (by the KS
that the data are consistent with the hypothesis of test). The hypothesis of identical distributions is
a standard normal distribution at the 0.10 signifi- much stronger than the hypothesis of equal means
cance level. In fact, the p value is about 0.8472. and/or variances. As in the example, two distribu-
However, the data in Table 1 are in fact drawn tions can have equal means and variances but not
from a true distribution that is not standard nor- be identical. To have identical distributions means
mal. The KS test incorrectly accepts (fails to reject) that all possible corresponding pairs of parameters
the null hypothesis that the true distribution is are equal. The KS test has some sensitivity to all
standard normal. The KS test commits a Type II differences between distributions, but it achieves
error. The true distribution is uniform on the range that breadth of sensitivity by sacrificing sensitivity
ð1:7321; þ1:7321Þ. This uniform distribution to differences in specific parameters. Fourth, if it is
has a mean of 0 and a variance of 1, just like the not really important to distinguish distributions
standard normal. The sample size of 50 is insuffi- that differ by small gaps (such as 0.0572)—if only
cient to distinguish between the hypothesized stan- large gaps really matter—then the KS test might be
dard normal distribution and the true uniform quite satisfactory. In the example, this line of
distribution with the same mean and variance. The thought would imply researcher indifference to the
maximum KS gap between the true uniform distri- shapes of the distributions (uniform vs. normal)—
bution and the hypothesized standard normal dis- the uniform distribution on the range ð1:7321,
tribution is only 0.0572. Because the critical value þ1:7321Þ would be considered ‘‘close enough’’ to
for the KS test using 50 data and a 0.10 signifi- normal for the intended purpose.
cance level is about three times the true gap, it is
Thomas W. Sager
very unlikely that an empirical gap of the magni-
tude required to reject the hypothesized standard See also Distribution; Nonparametric Statistics
normal distribution can be obtained. A much
larger data
pffiffiffi set is required. The critical value
(1:224= nÞ for a test at the 0.10 significance level Further Readings
must be substantially smaller than the true gap
Conover, W. J. (1999). Practical nonparametric statistics
(0.0572) for the KS test to have much power. (3rd ed.). New York: Wiley.
Hollander, M., & Wolfe, D. A. (1999). Nonparametric
Evaluation statistical methods (2nd ed.). New York:
Wiley-Interscience.
Several general lessons can be drawn from this Khamis, H. J. (2000). The two-stage δ-corrected
example. First, it is difficult for the KS test to dis- Kolmogorov-Smirnov test. Journal of Applied
tinguish small differences between the hypothesized Statistics, 27; 439–450.
KR-20 667
Lilliefors, H. (1967). On the Kolmogorov-Smirnov test where K is the number of i items or observations,
for normality with mean and variance unknown. pi is the proportion of responses in the keyed
Journal of the American Statistical Association, 62, direction for item i; qi ¼ 1 pi ; and σ 2X is the vari-
399–402. ance of the raw summed scores. Therefore, KR-20
Stephens, M. A. (1974). EDF statistics for goodness of fit
is a function of the number of items, item diffi-
and some comparisons. Journal of the American
Statistical Association, 69, 730–737.
culty, and the variance of examinee raw scores. It
Stephens, M. A. (1986). Tests based on EDF statistics. In is also a function of the item-total correlations
R. B. D’Agostino, & M. A. Stevens (Eds.), (classical discrimination statistics) and increases as
Goodness-of-fit techniques. New York: Marcel the average item-total correlation increases.
Dekker. KR-20 produces results equivalent to coefficient
α, which is another index of internal consistency,
and can be considered a special case of α. KR-20
can be calculated only on dichotomous data,
where each item in the measurement instrument is
KR-20 cored into only two categories. Examples of this
include true/false, correct/incorrect, yes/no, and
KR-20 (Kuder–Richardson Formula 20) is an present/absent. Coefficient α also can be calculated
index of the internal consistency reliability of on polytomous data, that is, data with more than
a measurement instrument, such as a test, ques- two levels. A common example of polytomous
tionnaire, or inventory. Although it can be applied data is a Likert-type rating scale.
to any test item responses that are dichotomously Like α, KR-20 can be described as the mean of
scored, it is most often used in classical psycho- all possible split-half reliability coefficients based
metric analysis of psychoeducational tests and, as on the Flanagan–Rulon approach of split-half reli-
such, is discussed with this perspective. ability. An additional interpretation is derived
Values of KR-20 generally range from 0.0 to from Formula 1: The term pi qi represents the vari-
1.0, with higher values representing a more inter- ance of each item. If this is considered error vari-
nally consistent instrument. In very rare cases, typ- ance, then the sum of the item variances divided
ically with very small samples, values less than 0.0 by the total variance in scores presents the propor-
can occur, which indicates an extremely unreliable tion of variance resulting from error. Subtracting
measurement. A rule-of-thumb commonly applied this quantity from 1 translates it into the propor-
in practice is that 0.7 is an acceptable value or 0.8 tion of variance not resulting from error, assuming
for longer tests of 50 items or more. Squaring KR- there is no source of error other than the random
20 provides an estimate of the proportion of score error present in the process of an examinee
variance not resulting from error. Measurements responding to each item.
with KR-20 < 0.7 have the majority of score vari- G. Frederic Kuder and Marion Richardson
ance resulting from error, which is unacceptable in also developed a simplification of KR-20 called
most situations. KR-21, which assumes that the item difficulties
Internal consistency reliability is defined as the are equivalent. KR-21 allows us to substitute the
consistency, repeatability, or homogeneity of mea- mean of the pi and qi into Formula 1 for pi
surement given a set of item responses. Several and qi ; which simplifies the calculation of the
approaches to reliability exist, and the approach reliability.
relevant to a specific application depends on the An important application of KR-20 is the calcu-
sources of error that are of interest, with internal lation of the classical standard error of measure-
consistency being appropriate for error resulting ment (SEM) for a measurement. The SEM is
from differing items. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
KR-20 is calculated as SEM ¼ sX 1 KR20; ð2Þ
" PK #
K pi q i
where sX is the standard deviation of the raw
KR20 ¼ 1 i¼12 , ð1Þ scores in the sample. Note that this relationship is
K1 σX
inverse; with sX held constant, as KR-20 increases,
668 Krippendorff’s Alpha
the SEM decreases. Within classical test theory, devices, or coders of data, designed to indicate
having a more reliable measurement implies a their reliability. As a general measure, it is appli-
smaller SEM for all examinees. cable to data on various levels of measurement
The following data set is an example of the cal- (metrics) and includes some known coefficients
culation of KR-20 with 5 examinees and 10 items. as special cases. As a statistical measure, it maps
samples from a population of data into a single
Item chance corrected coefficient, a scale, indicating
the extent to which the population of data can
Person 1 2 3 4 5 6 7 8 9 10 X be relied on or trusted in subsequent analyses.
1 1 1 1 1 0 1 1 1 1 1 9 Alpha equates reliability with the reproducibility
2 1 0 1 1 1 0 1 1 0 1 7 of the data-generating process, measured by the
3 1 0 1 0 0 1 1 0 1 1 6 agreement on what the data in question refer to
4 0 0 1 1 0 1 0 0 1 1 5 or mean. Typical applications of α are content
5 1 1 1 1 1 1 1 0 1 0 8 analyses where volumes of text need to be read
pi: 0.8 0.2 1.0 0.8 0.4 0.8 0.8 0.4 0.8 0.8 and categorized, interview responses that require
qi: 0.2 0.8 0.0 0.2 0.6 0.2 0.2 0.6 0.2 0.2 scaling or ranking before they can be treated sta-
pi × qi 0.16 0.16 0.0 0.16 0.24 0.16 0.16 0.24 0.16 0.16 tistically, or estimates of political or economic
variables.
The variance of the raw scores (XÞ in the final
column is 2.5, resulting in
Reliability Data
10 1:6
KR-20 ¼ 1 ¼ 0:4 Data are considered reliable when researchers have
10 1 2:5
reasons to be confident that their data represent
real phenomena in the world outside their project,
as the KR-20 estimate of reliability.
or are not polluted by circumstances that are
Nathan A. Thompson extraneous to the process designed to generate
them. This confidence erodes with the emergence
See also Classical Test Theory; Coefficient Alpha; of disagreements, for example, among human
Internal Consistency Reliability; Reliability coders regarding how they judge, categorize, or
score given units of analysis, in the extreme, when
Further Readings their accounts of what they see or read is random.
To establish reliability requires duplications of the
Cortina, J. M. (1993). What is coefficient alpha? An data-making efforts by an ideally large number of
examination of theory and applications. Journal of coders. Figure 1 represents reliability data in their
Applied Psychology, 78, 98–104. most basic or canonical form, as a matrix of m
Cronbach, L. J. (1951). Coefficient alpha and the internal
coders by r units, containing the values ciu assigned
structure of tests. Psychometrika, 16, 297–334.
de Gruijter, D. N. M., & van der Kamp, L. J. T. (2007).
Statistical test theory for the behavioral sciences. Boca Units: 1 2 3 . . u . . . . . . . . r
Raton, FL: CRC Press.
Coders: 1 c11 . . . . c1u . . . . . . . c1r
Kuder, G. F., & Richardson, M. W. (1937). The theory of
the estimation of test reliability. Psychometrika, 2, : : : :
151–160. i ci1 . . . . ciu . . . . . . . cir
: : : :
: : : :
m cm1. . . . cmu . . . . . . . cmr
KRIPPENDORFF’S ALPHA m1 . . . . mu . . . . . . . mr
αratio in ratio data such as proportions or absolute pairable within units. The distribution of the mar-
numbers, αpolar in data recorded in bipolar oppo- ginal sums, nc: and n:c ; is the best estimate of the
site scales, and αcircular in data whose values consti- otherwise unknown distribution of values in the
tute a closed circle or recursions. population of data whose reliability is in question.
The earlier expression for α in terms of Figure 1 In coincidence matrix terms, α becomes
is computationally inefficient and can be simplified P P 2
in terms of a conceptually more convenient coinci- c k nckmetric δck
αmetric ¼ 1 P P nc: n:k
dence matrix representation, shown in Figure 2, 2
metric δck :
c k
summing the reliability data in Figure 1 as indi- n:: 1
cated. Coincidence matrices take advantage of the
This expression might be simplified with reference
fact that reliability does not depend on the identity
to particular metrics, for example, for nominal
of coders, only on whether pairs of values match
and binary data:
and what it means if they do not match, and on
estimates of how these values are distributed in the P P
c k6¼c nck
population of data whose reliability is in question. αnominal ¼ 1 P P nc: n:k
Therefore, they tabulate coincidences without ref- c k6¼c
n:: 1
erence to coders. Coincidence matrices should not
P P nc: ðn:c 1Þ
be confused with the more familiar contingency c ncc c
matrices that cross-tabulate units of analysis as n:: 1
¼ ,
judged or responded to by two coders (not the P nc: ðn:c 1Þ
n:: c
values they jointly generate). n:: 1
nc6¼k
αbinary ¼ 1 ðn:: 1Þ :
nc: n:k
Categories: 1 . k . .
1 n11 . n1k . . n1.
It should be mentioned that the family of α coef-
. . . . . . .
c nc1 . nck . . n c.
ficients also includes versions for coding units with
. . . . . . . multiple values as well as for unitizing continua, for
. . . . . . . example, of texts taken as character strings or tape
n.1 . n.k . . n.. = Number of values recordings. These are not discussed here.
used by all coders
Categories c: 1 2 3 4
Units u: 1 2 3 4 5 6 7 8 9 10 11
k: 1 3 1 4
Coder h: 2 3 4 4 2 2 4 4 2 2 1 6 1 8
i: 2 3 1 4 4 2 1 3 3 3 3 3 1 4 2 7
j: 2 3 1 4 4 2 1 4 3 4 2 7 9
nc:
mu : 3 3 2 3 3 3 3 3 3 2 1 4 8 7 9 28
chance. The remaining values exhibit some agree- to be taken as sufficiently reliable. The distribu-
ment but also much uncertainty as to what the tion of α becomes narrower not only with
coded units are. The researcher would not know. increasing numbers of units sampled but also
In the coincidence matrix, one might notice dis- with increasing numbers of coders employed in
agreements to follow a pattern. They occur exclu- the coding process.
sively near the diagonal of perfect agreements. The choice of min α depends on the validity
There are no disagreements between extreme requirements of research undertaken with
values, 1-4, or of the 1-3 and 2-4 kind. This pat- imperfect data. In academic research, it is com-
tern of disagreement would be expected in interval mon to aim for α ≥ 0.9 but require α ≥ 0.8 and
data. When the reliability data in Figure 3 are trea- accept data with α between 0.666 and 0.800
ted as interval data, αinterval ¼ 0:877. The interval only to draw tentative conclusions. When
α takes into account the proximity of the mis- human lives or valuable resources are at stake,
matching scale values—irrelevant in nominal data min α must be set higher.
and appropriately ignored by αnominal. Hence, for To obtain the confidence limits for α at p and
data in Figure 3: αnominal < αinterval . Had the dis- probabilities q for a chosen min α, the distributions
agreements been scattered randomly throughout of αmetric are obtained by bootstrapping in prefer-
the off-diagonal cells of the coincidence matrix, ence to mere mathematical approximations.
αnominal and αinterval would not differ. Had disagree-
ment been predominantly between the extreme
Agreement Coefficients Embraced by Alpha
values of the scale, e.g., 1-4, αnominal would have
exceeded αinterval . This property of α provides the Alpha generalizes several known coefficients. It is
researcher with a diagnostic device to establish defined to bring coefficients for different metrics
how coders use the given values. but of the same makeup under the same roof.
Alpha is applicable to any number of coders,
which includes coefficients defined only for two.
Statistical Properties of Alpha
Alpha has no problem with missing data, as pro-
A common mistake is to accept data as reliable vided in Figure 1, which includes complete m × r
when the null hypothesis that agreement results data as a special case. Alpha corrects for small
from chance fails. This test is seriously flawed as sample sizes, which includes the extreme of very
far as reliability is concerned. Reliable data need large samples of data.
to contain no or only statistically insignificant When data are nominal, generated by two
disagreements. Acknowledging this requirement, coders, and consist of large sample sizes, αnominal
a distribution of α offers two statistical indices: equals Scott’s π. Scott’s π, a popular and widely
(1) α’s confidence limits, low α and high α, at a cho- used coefficient in content analysis and survey
sen level of statistical significance p; and more research, conforms to α’s conception of chance.
importantly, (2) the probability q that the mea- When data are ordinal, generated by two coders,
sured a fails to exceed the min a required for data and very large, αordinal equals Spearmann’s rank
672 Krippendorff’s Alpha
Categories: ci ki ci ki
c j 100 200 300 c j 100 100 200
Figure 4 Example of Kappa Adding Systematic Disagreements to the Reliability It Claims to Measure
correlation coefficient ρ without ties in ranks. about when reliability is absent. Percent agreement
When data are interval, generated by two coders, is not interpretable as a reliability scale—unless
and numerically large, αinterval equals Pearson’s corrected for chance, which is what Scott’s π does.
intraclass correlation coefficient rii : The intraclass Cohen’s 1960 k (kappa), which also is limited
correlation is the product moment correlation to nominal data, two coders, and large sample
coefficient applied to symmetrical coincidence sizes, has the undesirable property of counting sys-
matrices rather than to asymmetrical contingency tematic disagreement among coders as agreement.
matrices. There is a generalization of Scott’s π to This is evident in unequal marginal distribution of
larger numbers of coders by Joseph Fleiss, who categories in contingency matrices, which rewards
thought he was generalizing Cohen’s , renamed K coders who disagree on their use of categories with
by Sidney Siegel and John Castellan. K equals higher values. Figure 4 shows two numerical
αnominal for a fixed number of coders with complete examples of reliability data, tabulated in contin-
nominal data and very large, theoretically infinite gency matrices between two coders i and j in
sample sizes. Recently, there have been two close which terms is originally defined.
reinventions of α, one by Kenneth Berry and Paul Both examples show 50% agreement. They dif-
Mielke and one by Michael Fay. fer in their marginal distributions of categories. In
the left example, data show coder i to prefer cate-
gory k to category c at a ratio of 3:1, whereas
Agreement Coefficients Unsuitable coder j exhibits the opposite preference for c over
k at the rate of 3:1—a systematic disagreement,
as Indices of Data Reliability
absent in the data in the right example. The two
Correlation coefficients for interval data and asso- examples have the same number of disagreements.
ciation coefficients for nominal data measure Yet, the example with systematic disagreements
dependencies, statistical associations, between vari- measures ¼ 0.200, whereas the one without that
ables or coders, not agreements, and therefore can- systematic disagreement measures ¼ 0.000. In
not serve as measures of data reliability. In systems both examples, α ¼ 0.001. When sample sizes
of correlations among many variables, the correla- become large, α for two coders converges to Scott’s
tion among the same variables are often called reli- π at which point α ¼ π ¼ 0:000. Evidently,
abilities. They do not measure agreement, however, Cohen’s gives ‘‘agreement credit’’ for this system-
and cannot assess data reliability. atic disagreement, whereas π and α do not. The
Percent agreement, limited to nominal data gen- reason for ’s mistaken account of these systematic
erated by two coders, varies from 0% to 100%, is disagreements lies in Cohen’s adoption of statistical
the more difficult to achieve the more values are independence between two coders as its conception
available for coding, and provides no indication of chance. This is customary when measuring
Kruskal–Wallis Test 673
correlations or associations but has nothing to do Scott, W. A. (1955). Reliability of content analysis: The
with assigning units to categories. By contrast, in π case of nominal scale coding. Public Opinion
and α, chance is defined as the statistical indepen- Quarterly, 19, 321–325.
dence between the set of units coded and the values Siegel, S., & Castellan, N. J. (1988). Nonparametric
statistics for the behavioral sciences (2nd ed.). Boston:
used to describe them. The margins of coincidence
McGraw-Hill.
matrices estimate the distribution of values occur-
ring in the population whose reliability is in ques-
tion. The two marginal distributions of values in
contingency matrices, by contrast, refer to coder
preferences, not to population estimates. Notwith- KRUSKAL–WALLIS TEST
standing its popularity, Cohen’s is inappropriate
when the reliability of data is to be assessed. The Kruskal–Wallis test is a nonparametric test to
Finally, Cronbach’s alpha for interval data and decide whether k independent samples are from
Kuder and Richardson’s Formula-20 (KR-20) for different populations. Different samples almost
binary data, which are widely used in psychomet- always show variation regarding their sample
ric and educational research, aim to measure the values. This might be a result of chance (i.e., sam-
reliability of psychological tests by correlating the pling error) if the samples are drawn from the
test results among multiple subjects. As Jum same population, or it might be a result of a genu-
Nunnally and Ira Bernstein have observed, system- ine population difference (e.g., as a result of a dif-
atic errors are unimportant when studying individ- ferent treatment of the samples). Usually the
ual differences. However, systematically biased decision between these alternatives is calculated by
coders reduce the reliability of the data they gener- a one-way analysis of variance (ANOVA). But in
ate and such disagreement must not be ignored. cases where the conditions of an ANOVA are not
Perhaps for this reason, Cronbach’s alpha is fulfilled the Kruskal–Wallis test is an alternative
increasingly interpreted as a measure on the inter- approach because it is a nonparametric method;
nal consistency of tests. It is not interpretable as an that is, it does not rely on the assumption that the
index of the reliability of coded data. data are drawn from a probability distribution
(e.g., normal distribution).
Klaus Krippendorff
Related nonparametric tests are the Mann–
See also Coefficient Alpha; Cohen’s Kappa; Content Whitney U test for only k ¼ 2 independent sam-
Analysis; Interrater Reliability; KR-20; Replication; ples, the Wilcoxon signed rank test for k ¼ 2
‘‘Validity’’ paired samples, and the Friedman test for k > 2
paired samples (repeated measurement) and are
shown in Table 1.
The test is named after William H. Kruskal and
W. Allen Wallis and was first published in the Jour-
Further Readings
nal of the American Statistical Association in
Berry, K. J., & Mielke, P. W., Jr. (1988). A generalization 1952. Kruskal and Wallis termed the test as the H
of Cohen’s kappa agreement measure to interval test; sometimes the test is also named one-way
measurement and multiple raters. Educational and analysis of variance by ranks.
Psychological Measurement, 48, 921–933.
Fay, M. P. (2005). Random marginal agreement
coefficients: Rethinking the adjustments of chance Table 1 Nonparametric Tests to Decide Whether k
when measuring agreement. Biostatistics, 6, 171–180. Samples Are From Different Populations
Hayes, A. F., & Krippendorff, K. (2007). Answering the
call for a standard reliability measure for coding data. k¼2 k>2
Communication Methods and Measures, 1, 77–89. Independent Mann–Whitney Kruskal–Wallis
Krippendorff, K. (2004). Content analysis: An introduction samples U test test
to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Paired Wilcoxon Friedman
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
samples signed rank test test
theory (3rd ed.). New York: McGraw-Hill.
674 Kruskal–Wallis Test
Table 2 Example Data, N ¼15 Observations From both receive rank 4.5. Furthermore, three sam-
k ¼ 3 Different Samples (Groups) ples had a value of 1310 and would have
received ranks from 6 to 8. Now they get rank 7
Group 1 Group 2 Group 3
as the mean of 6, 7, and 8.
Observation Rank Observation Rank Observation Rank In the next step, the sum of ranks (R1 ; R2 ; R3 Þ
2400 14 1280 4.5 1280 4.5 for each group is calculated. The overall sum of
1860 10 1690 9 1090 1 ranks is NðN þ 1Þ=2. In the example case, this is
2240 13 1890 11 2220 12 15 × 16 / 2 ¼ 120. As a first control, the sum of
1310 7 1100 2 1310 7 ranks for all groups should add up to the same
2700 15 1210 3 1310 7 value: R1 þ R2 þ R3 ¼ 59 þ 29:5 þ 31:5 ¼ 120.
R1 59 R2 29.5 R3 31.5 Distributing the ranks among the three groups
randomly, each rank sum would be about
Notes: Next to the observation, the individual rank for this
observation can be seen. Ranks are added up per sample to
120=3 ¼ 40. The idea is to measure and add the
a rank sum termed Ri : squared deviations from this expectancy value:
the case of shared ranks (ties), the according mean In this formula, m stands for the number of ties
rank value has to be assigned. Next a sum of ranks occurred, and ti stands for the number of tied
Ri for all k samples (from i ¼ 1 to i ¼ kÞ has to be ranks occurred in a specific tie i:
computed. The according number of observations In the preceding example, there were m ¼ 2 tied
of each sample is denoted by Ni ; the number of all observations, one for the ranks 4 to 5, and another
observations by N: The Kruskal–Wallis test statis- one for the ranks ranging from 6 to 8. For the first
tic H is computed according to tie t1 ¼ 2, since two observations are tied, for the
second tie t2 ¼ 3, because three observations are
12 X k
R2i identical. Thus, we compute the following correc-
H¼ 3ðN þ 1Þ: tion coefficient:
NðN þ 1Þ i¼1 Ni
ð23 2Þ þ ð33 3Þ
For larger samples, H is approximately chi- C¼1 ¼ 0:991;
153 15
square distributed with k 1 degrees of freedom.
For smaller samples, an exact test has to be per- with this coefficient, Hcorr calculates to
formed and the test statistic H has to be compared
with critical values in tables, which can be found 5:435
Hcorr ¼ ¼ 5:484:
in statistics books and on the Internet. (The tables 0:991
provided by Kruskal and Wallis in the Journal of
the American Statistical Association, 1952, 47, Several issues might be noted as seen in this
614617, contain some errors; an errata can be example. The correction coefficient will always be
found in the Journal of the American Statistical smaller than one, and thus, H will always increase
Association, 1953, 48; 910.) These tables are by this correction formula. If the null hypothesis is
based on a full permutation of all possible rank already rejected by an uncorrected H; any further
distributions for a certain case. By this technique, correction might strengthen the significance of the
an empirical distribution of the test criterion H is result, but it will never result in not rejecting the
obtained by a full permutation. Next the position null hypothesis. Furthermore, the correction of H
of the obtained H within this distribution can be resulting from this computation is very small, even
determined. The according p value reflects the though 5 out of 15 (33%) of the observations in
cumulative probability of H to obtain this or even the example were tied. Even in this case where the
a larger value by chance alone. There is no consis- uncorrected H was very close to significance, the
tent opinion what exactly forms a small sample. correction was negligible. From this perspective, it
Most authors recommend for k ¼ 3 and Ni ≤ 8 is only necessary to apply this correction, if N is
observations per sample, for k ¼ 4 and Ni ≤ 4, very small or if the number of tied observations is
and for k ¼ 5 and Ni ≤ 3 to perform the exact relatively large compared with N; some authors
test. recommend here a ratio of 25%.
Assumptions
Correction for Ties
The Kruskal–Wallis test does not assume a normal
When ties (i.e., shared ranks) are involved in the distribution of the data. Thus, whenever the
data, there is a possibility of correcting for this fact requirement for a one-way ANOVA to have nor-
when computing H: mally distributed data is not met, the Kruskal–
H Wallis test can be applied instead. Compared with
Hcorr ¼ ;
C the F test, the Kruskal–Wallis test is reported to
have an asymptotic efficiency of 95.5%.
thereby C is computed by
However, several other assumptions have to be
P
m met for the Kruskal–Wallis test:
ðti3 ti Þ
C¼1 i¼1
: 1. Variables must have at least an ordinal level
N3 N (i.e., rank-ordered data).
676 Kurtosis
2. All samples have to be random samples, and the There are several alternative ways of measuring
samples have to be mutually independent. kurtosis; they differ in their sensitivity to the tails
3. Variables need to be continuous, although of the distribution and to the presence of outliers.
a moderate number of ties is tolerable as shown Some tests of normality are based on the com-
previously. parison of the skewness and kurtosis of the data
with the values corresponding to a normal distri-
4. The populations from which the different
samples drawn should only differ
bution. Tools to do inference about means and var-
regarding their central tendencies but not iances, many of them developed under the
regarding their overall shape. This means that assumption of normality, see their performance
the populations might differ, for example, in affected when applied to data from a distribution
their medians or means, but not in their with high kurtosis.
dispersions or distributional shape The next two sections focus on kurtosis of theo-
(such as, e.g., skewness). If this assumption is retical distributions, and the last two deal with
violated by populations of dramatically kurtosis in the data analysis context.
different shapes, the test loses its
consistency (i.e., a rejection of the null
hypothesis by increasing N is not anymore Comparing Distributions in Terms of Kurtosis
guaranteed in a case where the null hypothesis
A distribution such as the Laplace is said to have
is not valid).
higher kurtosis than the normal distribution
Stefan Schmidt because it has more mass toward the center and
heavier tails [see Figure 1(a)]. To visually compare
See also Analysis of Variance (ANOVA); Friedman Test; the density functions of two symmetric distribu-
Mann–Whitney U Test; Nonparametric Statistics; tions in terms of kurtosis, these should have the
Wilcoxon Rank Sum Test same center and variance. Figure 1(b) displays the
corresponding cumulative version or distribution
Further Readings functions (CDFs). Willem R. van Zwet defined
in 1964 a criterion to compare and order sym-
Conover, W. J. (1971). Practical nonparametric statistics.
New York: Wiley.
metric distributions based on their CDFs.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in According to this criterion, the normal has
one-criterion variance analysis. Journal of the indeed no larger kurtosis than the Laplace distri-
American Statistical Association, 47, 583–621. bution. However, not all the symmetric distribu-
Kruskal, W. H., & Wallis, W. A. (1953). Errata: use of tions are ordered.
ranks in one-criterion variance analysis. Journal of the
American Statistical Association, 48, 907–911.
Measuring Kurtosis in Distributions
In Greek, kurtos means convex; the mathematician
Heron in the first century used the word kurtosis
KURTOSIS to mean curvature. Kurtosis was defined, as a sta-
tistical term, by Karl Pearson around 1905 as the
Density functions are used to describe the distri- measure
bution of quantitative variables. Kurtosis is
4
a characteristic of the shape of the density func- β2 ¼ Eðx μÞ =σ 4
tion related to both the center and the tails. Dis-
tributions with density functions that have to compare other distributions with the normal
significantly more mass toward the center and in distribution in terms of the frequency toward the
the tails than the normal distribution are said to mean μ (σ is the standard deviation and β2 ¼ 3
have high kurtosis. Kurtosis is invariant under for the normal distribution). It was later that
changes in location and scale; thus, kurtosis ordering criteria based on the distribution func-
remains the same after a change in units or the tions were defined; in addition, more flexible
standardization of data. definitions acknowledging that kurtosis is related
Kurtosis 677
(a) (b)
0.8 1.0
Normal
0.7 Laplace
0.8
0.6
0.5
0.6
F(x)
F(x)
0.4
0.4
0.3
0.2
0.2
0.1
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0
−5.0 −2.5 0.0 2.5 5.0
x
x
Figure 1 Density Functions and CDFs for Normal and Laplace Distributions
Notes: (a) Density functions. (b) Cumulative distribution functions.
(a) (b)
60
99.9
50 99
95
40 90
80
Frequency
Percent
70
30 60
50
40
30
20 20
10
5 Mean −0.1063
10 StDev 0.9480
1 n 200
p value < 0.005
0 0.1
−4 −3 −2 −1 0 1 2 3 −4 −2 0 2 4
x x
Figure 2 Histogram and Normal Probability Plot for a Sample From a Laplace Distribution
Notes: (a) Histogram of data. (b) Probability plot of data. Normal ¼ 95% CI.
678 Kurtosis
to both peakedness and tail weight were pro- the excess β ^2 3 (for the data in Figure 2,
posed, and it was accepted that kurtosis could be ^
β2 3 ¼ 3:09Þ.
measured in several ways. New measures of kur- If the sample size n is small,
tosis, to be considered valid, have to agree with hP i
the orderings defined over distributions by the ðx xÞ4 =n
criteria based on distribution functions. It is said b2 ¼
s4
that some kurtosis measures, such as β2 , natu-
rally have an averaging effect that prevents them might take a small value even if the kurtosis of the
from being as informative as the CDFs. Two dis- population is very high [the upper bound for b2 is
tributions can have the same value of β2 and still n 2 þ 1=ðn 1Þ. Adjusted estimators exist to
look different. Because kurtosis is related to the reduce the bias, at least in the case of nearly nor-
peak and tails of a distribution, in the case of mal distributions. A commonly used adjusted esti-
nonsymmetric distributions, kurtosis and skew- mator of excess is
ness tend to be associated, particularly if they hP i
are represented by measures that are highly sen- 4
nðn þ 1Þ ðx x
Þ
sitive to the tails.
Two of the several kurtosis measures that have ðn 1Þðn 2Þðn 3Þ s4
been defined as alternatives to β2 are as follows: 3ðn 1Þðn 1Þ
:
ðn 2Þðn 3Þ
1. L-kurtosis defined by J. R. M. Hosking in 1990
and widely used in the field of hydrology; A single distant outlier can dramatically change
τ4 ¼ L4 =L2 is a ratio of L moments that are ^2 .
the value of β
linear combinations of expected values of order
statistics.
Effects of High Kurtosis
2. Quantile kurtosis, γ 2 ðpÞ, defined by Richard
Groeneveld in 1998 for symmetric distributions The variance of the sample variance is related to
only, is based on distances between certain β2 . The power of some tests for the equality of var-
quantiles. Other kurtosis measures defined in iances gets affected by high kurtosis. For example,
terms of quantiles, quartiles, and octiles also when testing the hypothesis of equal variances for
exist. two populations based on two independent sam-
ples, the power of the Levene test is lower if the
As an illustration, the values of these kurtosis distribution of the populations is symmetric but
measures are displayed in Table 1 for some with higher kurtosis than if the samples come from
distributions. normal distributions. The performance of the t test
for the population mean is also affected under
situations of high kurtosis. Van Zwet proved that,
Studying Kurtosis From Data when working with symmetric distributions, the
median is more efficient than the mean as estima-
Histograms and probability plots help to explore tor of the center of the distribution if the latter has
sample data. Figure 2 indicates that the data might very high kurtosis.
come from a distribution with higher kurtosis than
the normal. Statistical software generally calculates Edith Seier
Kurtosis 679
See also Distribution; Median; Normal Distribution; Byers, R. H. (2000). On the maximum of the
Student’s t Test; Variance standardized fourth moment. InterStat, January:
Hosking, J. R. M. (1992). Moments or L moments? An
example comparing two measures of distributional
Further Readings shape. American Statistician, 46, 186–199.
Balanda, K. P., & MacGillivray, H. L. (1998). Kurtosis: A Ruppert, D. (1997). What is kurtosis? An influence
critical review. American Statistician, 42, 111–119. function approach. American Statistician, 41, 1–5.
Bonett, D. G., & Seier, E. (2002). A test of normality Seier, E., & Bonett, D. G. (2003). Two families of
with high uniform power. Computational Statistics kurtosis measures. Metrika, 58, 59–70.
and Data Analysis, 40, 435–445.
L
the summary estimate will be much improved by
L’ABBÉ PLOT combining the results of many small studies. In
this hypothesized meta-analysis, the pooled esti-
mate of relative risk is 0.72 (95% confidence
The L’Abbé plot is one of several graphs com- interval: 0.530.97), which suggests that the risk
monly used to display data visually in a meta- of lack of clinical improvement in the treatment
analysis of clinical trials that compare a treatment group is statistically significantly lower than that
and a control intervention. It is basically a scatter- in the control group. However, the results from
plot of results of individual studies with the risk the 10 trials vary considerably (Figure 1), and it is
in the treatment group on the vertical axis and important to investigate why similar trials of the
the risk in the control group on the horizontal same intervention might yield different results.
axis. This plot was advocated in 1987 by Kristan Figure 2 is the L’Abbé plot for the hypothesized
L’Abbé and colleagues for visually showing varia- meta-analysis. The vertical axis shows the event
tions in observed results across individual trials in rate (or risk) of a lack of clinical improvement in
meta-analysis. This entry briefly discusses meta- the treatment group, and the horizontal axis shows
analysis before addressing the usefulness, limita- the event rate of a lack of clinical improvement
tions, and inappropriate uses of the L’Abbé plot. in the control group. Each point represents the
result of a trial, according to the corresponding
event rates in the treatment and the control group.
Meta-Analysis
The size of the points is proportionate to the trial
To understand what the L’Abbé plot is, it is nec- size or the precision of the result. The larger the
essary to have a discussion about meta-analysis. sample size, the larger the point in Figure 2. How-
Briefly, meta-analysis is a statistical method to ever, it should be mentioned that smaller points
provide a summary estimate by combining the might represent larger trials in a L’Abbé plot pro-
results of many similar studies. A hypothesized duced by some meta-analysis software.
meta-analysis of 10 clinical trials is used here to The diagonal line (line A) in Figure 2 is called
illustrate the use of the L’Abbé plot. The most the equal line, indicating the same event rate
commonly used graph in meta-analysis is the for- between the two arms within a trial. That is, a trial
est plot (as shown in Figure 1) to display data point will lie on the equal line when the event rate
from individual trials and the summary estimate in the treatment group equals that in the control
(including point estimates and 95% confidence group. Points below the equal line indicate that
intervals). The precision or statistical power of the risk in the treatment group is lower than that
681
682 L’Abbé Plot
Figure 1 A Hypothesized Meta-Analysis of 10 Clinical Trials Comparing a Treatment and a Control Intervention
Note: Outcome: lack of clinical improvement.
60%
in meta-analysis. This overall RR line corresponds
to a pooled relative risk of 0.72 in Figure 2. It
50% A: equal line Event Rate
would be expected that the points of most trials
Treatment Group Risk
patient and/or intervention characteristics might different people. In addition, the same pattern of
also be the causes of variation in results across variations across studies revealed in a L’Abbé plot
trials. The effect of a treatment might be associ- might have very different causes. The usefulness of
ated with the severity of illness, age, gender, or the L’Abbé plot is also restricted by the available
other patient characteristics. Trial results might data reported in the primary studies in meta-analysis.
vary because of different doses of medications, dif- When the number of the available studies in
ferent intensity of interventions, different level of a meta-analysis is small and when data on impor-
training and experience of doctors, and other dif- tant variables are not reported, the investigation
ferences in settings or interventions. of heterogeneity will be unlikely fruitful.
The variation in results across studies is termed The visual perception of variations across studies
heterogeneity in the meta-analysis. Several graphi- in a L’Abbé plot might be misleading because ran-
cal methods can be used for the investigation dom variation in the distance between a study point
of heterogeneity in meta-analysis. The commonly and the overall RR line is associated with both the
used graphical methods are forest plot, funnel plot, sample size of a trial and the event rate in the control
Galbraith plot, and the L’Abbé plot. Only esti- group. Points of small trials are more likely farther
mates of relative effects (including relative risk, away from the overall RR line purely by chance. In
odds ratio, or risk difference) between the treat- addition, trials with a control event rate closing to
ment and control group are displayed in forest 50% will have great random variation in the dis-
plot, funnel plot, and Galbraith plot. As compared tance from the overall RR line. It is possible that the
with other graphical methods, an advantage with distances between trial points and the overall RR
the L’Abbé plot in meta-analysis is that it can line in a L’Abbé plot are adjusted by the correspond-
reveal not only the variations in estimated relative ing sample sizes and event rates in the control group,
effects across individual studies but also the trial using a stochastic simulation approach. However,
arms that are responsible for such differences. This the stochastic simulation method is complex and
advantage of the L’Abbé plot might help research- cannot be used routinely in meta-analysis.
ers and clinicians to identify the focus of the inves-
tigation of heterogeneity in meta-analysis.
Inappropriate Uses
In the hypothesized meta-analysis (Figure 2), the
event rate varies greatly in both the control group L’Abbé plots have been used in some meta-analyses
(from 4.0% to 48.0%) and in the treatment group to identify visually the trial results that are outliers
(from 5.0% to 28.0%). The points of trials with according to the distance between a trial point and
relatively low event rates in the control group tend the overall RR line. Then, the identified outliers are
to locate above the overall RR line, and the points excluded from meta-analysis one by one until het-
of trials with relatively high event rates in the con- erogeneity across studies is no longer statistically
trol group tend to locate below the overall RR line. significant. However, this use of the L’Abbé plot is
This suggests that variations in relative risk across inappropriate because of the following reasons.
trials might be mainly a result of different event First, the exclusion of studies according to their
rates in the control group. Therefore, the event rate results, not their design and other study characteris-
in the control group might be associated with treat- tics, might introduce bias into meta-analysis and
ment effect in meta-analysis. In a real meta-analy- reduce the power of statistical tests of heterogeneity
sis, this pattern of graphical distribution should be in meta-analysis. Second, the chance of revealing
interpreted by considering other patient and/or clinically important causes of heterogeneity might
intervention characteristics to investigate the possi- be missed simply by excluding studies from meta-
ble causes of variations in results across trials. analysis without efforts to investigate reasons for
the observed heterogeneity. In addition, different
methods might identify different trials as outliers,
Limitations
and the exclusion of different studies might lead to
A shortcoming of many graphical methods is that different results of the same meta-analysis.
the visual interpretation of data is subjective, and Another inappropriate use of the L’Abbé plot is
the same plot might be interpreted differently by to conduct a regression analysis of the event rate
684 Laboratory Experiments
in the treatment group against the event rate in the Song, F. (1999). Exploring heterogeneity in meta-analysis:
control group. If the result of such a regression Is the L’Abbé plot useful? Journal of Clinical
analysis is used to examine the relation between Epidemiology, 52, 725730.
treatment effect and the event rate in the con- Song, F., Sheldon, T. A., Sutton, A. J., Abrams, K. R., &
Jones, D. R. (2001). Methods for exploring
trol group, then misleading conclusions could be
heterogeneity in meta-analysis. Evaluation and the
obtained. This is because of the problem of regres- Health Professions, 24, 126151.
sion to the mean and random error in the esti-
mated event rates.
concrete structures such as the U.S. Army can be nowhere in nature, social science laboratories con-
studied experimentally. The results of laboratory tain situations isolating one or a few social pro-
research, with proper interpretation, can then be cesses for detailed study. The gain from this focus
applied in businesses, armies, or other situations is that experimental results are among the stron-
meeting the conditions of the theory being tested gest for testing hypotheses.
by the experiment. Most hypotheses, whether derived from general
theoretical principles or simply formulated ad hoc,
have the form ‘‘If A then B.’’ To test such a sen-
Laboratories as Created Situations
tence, treat ‘‘A’’ as an independent variable and
The essential character of a laboratory experiment ‘‘B’’ as a dependent variable. Finding that B is pres-
is that it creates an invented social situation that ent when A is also present gives some confidence
isolates theoretically important processes. Such the hypothesis is correct, but of course the concern
a situation is usually unlike any naturally occurring is that something else besides A better accounts for
situation so that complicated relationships can the presence of B. But when the copresence of
be disentangled. In an experiment, team partners A and B occur in a laboratory, an investigator has
might have to resolve many disagreements over had the opportunity to remove other possible can-
a collective task, or they might be asked to decide didates besides A from the situation. In a labora-
whether to offer a small gift to people who always tory test, the experimental situation creates A and
(or never) reciprocate. In such cases, an investiga- then measures to determine the existence or the
tor is efficiently studying things that occur natu- extent of B.
rally only occasionally, or that are hard to observe Another concern in natural settings is direction
in the complexity of normal social interaction. of causality. Finding A and B together is consistent
Bringing research into a laboratory allows an with (a) A causes B; (b) B causes A; or (c) some
investigator to simplify the complexity of social other factor C causes both A and B.
interaction to focus on the effects of one or a few Although we cannot ever observe causation,
social processes at a time. It also offers an oppor- even with laboratory data, such data can lend
tunity to improve data collection greatly using greater confidence in hypothesis (a). That is because
video and sound recordings, introduce question- an experimenter can introduce A before B occurs,
naires at various points, and interview participants thus making interpretation (b) unlikely. An experi-
about their interpretations of the situation and of menter can also simplify the laboratory situation to
people’s behavior in it. eliminate plausible Cs, even if some unknown fac-
Every element of the social structure, the inter- tor might still be present that is affecting B. In gen-
action conditions, and the independent variables is eral, the results from laboratory experiments help
included in the laboratory conditions because an in assessing the directionality of causation and elim-
investigator put it there. The same is true for the inating potential alternative explanations of
measurement operations used for the dependent observed outcomes.
variables. Well-designed experiments result from
a thorough understanding of the theoretical princi-
ples to be tested, long-term planning, and careful
Laboratories Abstract From Reality
attention to detail. Casually designed experiments
often produce results that are difficult to interpret, Experimental research requires conceptualizing
either because it is not clear exactly what hap- problems abstractly and generally. Many impor-
pened in them or because measurements seem to tant questions in social science are concrete or
be affected by unanticipated, perhaps inconsistent, unique, and therefore these questions do not
factors. lend themselves readily to laboratory experimen-
tation. For instance, the number of homeless
in the United States in a particular year is not
Strong Tests and Inferences
a question for laboratory methods. However,
All laboratories simplify nature. Just as chemis- effects of network structures, altruism, and moti-
try laboratories contain pure chemicals existing vational processes that might affect homelessness
686 Last Observation Carried Forward
Needless to say, the longer the trial and the more Problems
follow-up visits or interviews that are required, the
worse the problem of attrition becomes. In some Counterbalancing these advantages of LOCF are
clinical trials, drop-out rates approach 50% of several disadvantages. First, because all the missing
those who began the study. values for an individual are replaced with the same
LOCF is a method of data imputation, or number, the within-subject variability is artificially
‘‘filling in the blanks,’’ for data that are missing reduced. In turn, this reduces the estimate of the
because of attrition. This allows the data for all error and, because the within-person error contri-
participants to be used, ostensibly solving the butes to the denominator of any statistical test,
two problems of reduced sample size and biased it increases the likelihood of finding significance.
results. The method is quite simple, and consists of Thus, rather than being conservative, LOCF actu-
replacing all missing values of the dependent vari- ally might have a liberal bias and might lead to
able with the last value that was recorded for that erroneously significant results.
particular participant. The justification for using Second, just as LOCF assumes no additional
this technique is shown in Figure 1, where the left improvement for patients in the treatment condi-
axis represents symptoms, and lower scores are tion, it also assumes that those in the comparison
better. If the effect of the treatment is to reduce group will not change after they drop out of the
symptoms, then LOCF assumes that the person trial. However, for many conditions, there is a very
will not improve any more after dropping out of powerful placebo effect. In trials involving patients
the trial. Indeed, if the person discontinues very suffering from depression, up to 40% of those in
early, then there might not be any improvement the placebo arm of the study show significant
noted at all. This most probably underestimates improvement; and in studies of pain, this effect
the actual degree of improvement experienced by can be even stronger. Consequently, in underesti-
the patient and, thus, is a conservative bias; that is, mating the amount of change in the control group,
it works against the hypothesis that the interven- LOCF again might have a positive bias, favoring
tion works. If the findings of the study are that the rejection of the null hypothesis.
treatment does work, then the researcher can be Finally, LOCF should never be used when the
even more confident of the results. The same logic purpose of the intervention is to slow the rate
applies if the goal of treatment is to increase the of decline. For example, the so-called memory-
score on some scale; LOCF carries forward a smal- enhancing drugs slow the rate of memory loss for
ler improvement. patients suffering from mild or moderate demen-
tia. In Figure 1, the left axis would now represent
memory functioning, and thus lower scores are
worse. If a person drops out of the study, then
LOCF assumes no additional loss of functioning,
which biases the results in favor of the treatment.
20
In fact, the more people who drop out of the study,
Recorded Values and the earlier the drop-outs occur, the better the
15
drug looks. Consequently, LOCF introduces a very
strong liberal bias, which significantly overesti-
Score
at least three data points. In essence, a regression This entry begins with discussions of fixed
line is fitted to each person’s data, and the slope and random effects and of time-varying and time-
and intercept of the line become the predictor vari- invariant predictors. Next, approaches are des-
ables in another regression. This allows one to cribed and an example of the modeling process is
determine whether the average slope of the line provided. Last, additional extensions of latent
differs between groups. This does not preserve as growth modeling and its use in future research are
many cases as LOCF, because those who drop out examined.
with fewer than three data points cannot be ana-
lyzed, but latent growth modeling does not intro-
duce the same biases as does LOCF.
Fixed Versus Random Effects
David L. Streiner To understand growth modeling, one needs to
understand the concepts of fixed effects and ran-
See also Bias; Latent Growth Modeling; Missing Data,
dom effects. In ordinary least-squares regression,
Imputation of
a fixed intercept and a slope for each predictor are
estimated. In growth modeling, it is often the case
that each person has a different intercept and
Further Readings
slope, which are called random effects. Consider
Molnar, F. J., Hutton, B., & Fergusson, D. (2008). Does a growth model of marital conflict reported by the
analysis using ‘‘last observation carried forward’’ mother across the first 12 months after the birth of
introduce bias in dementia research? Canadian a couple’s first child. Conflict might be measured
Medical Association Journal, 179, 751753. on a 0 to 10 scale right after the birth and then
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal
every 2 months for the first year. There are 7 time
data analysis: Modeling change and event occurrence.
New York: Oxford University Press.
points (0, 2, . . . , 12) and a regression of the con-
flict scores on time might be done. Hypothetical
results appear in Figure 1 in the graph labeled
Both Intercept and Slope Are Fixed. The results
reflect an intercept β0 ¼ 2.5 and a slope β1 ¼ 0.2.
LATENT GROWTH MODELING Thus, the conflict starts with an initial level of 2.5
and increases by 0.2 every 2 months. By the 12th
Latent growth modeling refers to a set of proce- month, the conflict would be moderate, 2.5 þ
dures for conducting longitudinal analysis. Statisti- 0.2 × 12 ¼ 4.9. These results are fixed effects.
cians refer to these procedures as mixed models. However, women might vary in both their inter-
Many social scientists label these methods as mul- cept and their slope.
tilevel analyses, and the label of hierarchical linear In contrast, the graph in Figure 1 labeled Ran-
models is used in education and related disciplines. dom Intercept and Fixed Slope allows for differ-
These procedures can be useful with static data ences in the initial level and results in parallel
where an individual response might be nested in lines. Mother A has the same intercept and slope
a family. Thus, a response might be explained as the fixed-effects model, 2.5 and 0.2, respec-
by individual characteristics, such as personality tively. All three have a slope of 0.2, but they vary
traits, or by a family-level characteristic, such as in their intercept (starting point). This random
family income. intercept model, by providing for individual differ-
Longitudinal applications differ from static appli- ences in the intercept, should fit the data for all the
cations in that there are repeated measurements of mothers better than the fixed model, but the
a variable for each individual. The repeated mea- requirement that all lines are parallel might be
surements are nested in the individual. Just as indivi- unreasonable. An alternative approach is illus-
duals in a family tend to be similar, repeated trated in the graph labeled Fixed Intercept and
measurements for the same individual tend to be Random Slope. Here, all the mothers have a fixed
similar. This lack of independence is handled by initial level of conflict, but they are allowed to
mixed models. have different slopes (growth rates).
Latent Growth Modeling 689
6 8
Conflict
6
2
4
0
2
0 2 4 6 8 10 12
Months Since Birth of First Child 0
Mother A Mother C
0 2 4 6 8 10 12
Mother B Mother C’
Months Since Birth of First Child
Mother B’
Fixed Intercept and Random Slope Random Intercept and Fixed Slope
10 10
8 8
Conflict
6 Conflict 6
4 4
2 2
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Months Since Birth of First Child Months Since Birth of First Child
Mother A Mother C Mother A Mother C
Mother B Mother B
Figure 1 Hypothetical Growth Curves for Marital Conflict in First 12 Months After Birth of First Child
Finally, the graph labeled Random Intercept intercept and slope; here, these estimates, when
and Random Slope allows each mother to have they are random effects, are treated as outcomes.
her own intercept and her own slope. This graph Therefore, change in a variable is explained rather
is more complicated than the fully fixed graph, but than just a static level of a variable.
it might be a better fit to the data and seems realis- Latent growth modeling allows researchers to
tic. This model with a random intercept and a ran- explain such differences using two types of predic-
dom slope allows each mother to have a different tors (independent variables). These are called time-
initial level and a different growth trajectory. invariant covariates and time-varying covariates.
First, the intercept or slope might depend on time-
invariant covariates. These covariates are a con-
Time-Varying and Time-Invariant Predictors
stant value for each individual. For example, the
When either or both the intercept and slope are mother’s age at the birth and whether she was
treated as random effects, they are outcome vari- married at the time of birth are time invariant.
ables that call for an explanation. Why do some Time-varying covariates, by contrast, can have
mothers start with a high or low level of conflict? different values from one time to the next. Sup-
Why do mothers vary in their trajectory? The pose 6 months after the birth of a child, the father
intercept and slope become dependent variables. attends a childcare workshop and becomes a
Traditional regression analysis just estimates an more engaged parent. This would not predict her
690 Latent Growth Modeling
The next step is developing a simple latent changes for each unit change in the independent
growth model. The simplest is a linear growth variable. How is this translated to identify the
curve (called a curve, but a linear growth curve is latent slope growth factor? There are BMI mea-
actually a straight line requiring just an intercept surements for 7 consecutive years, 19972003.
factor and a linear slope factor). Figure 3 presents Because each of these is a 1-year change, load-
how a linear growth curve model can be drawn. ings of 0, 1, 2, 3, 4, 5, and 6 can be used, as
This figure is simpler than it might appear. The illustrated by the solid lines with an arrow going
oval labeled ‘‘intercept’’ is the latent intercept from the latent slope growth factor to each
growth factor. It represents the initial level of the year’s measurement of BMI. Other fixed loadings
growth curve. Based on the sample of 10 youth, might be appropriate. If no data were collected
one might guess this will have a value of just more in 2000 and 2002, there would be five waves
than 20. This value is the estimated initial and loadings of 0, 1, 2, 4, and 6 could be used,
Mintercept. The other oval, labeled ‘‘slope,’’ is the simply dropping the missing years. One might
latent slope growth factor. It represents how much want the intercept to represent the final level
the BMI increases (or decreases) each year. Using of the variable and use loadings of 6, 5, 4,
the sample of 10 youth, one might guess that this 3, 2, 1, and 0, or put the intercept in the
will be a small positive number, perhaps around middle using loadings of 3, 2, 1, 0, 1, 2,
0.5. This value is the Mslope. and 3. In the Mplus program, it is also possible
The sample of 10 youth indicates that there is for each participant to have a different time span
variation around both the mean latent intercept between measurements. John’s BMI98 (BMI
growth factor and the mean latent slope growth measurement in 1998) might be 14 months after
factor. This variance is the random-effect com- his first measurement if there was some delay in
ponent and is represented by the circles above data collection. His BMI99 might be only 10
the ovals, labeled Ri (residual variance of latent months after his second wave.
intercept growth factor) and Rs (residual vari- The observed measurements appear in the rect-
ance of latent slope growth factor). If one of angular boxes and are labeled BMI97 to BMI03.
these variances is not significantly greater than Figure 3 has circles at the bottom labeled E97 to
zero, then that factor could be treated as a fixed E03. These represent measurement error. SEM
effect. The curved line with an arrow at each software varies widely in how it programs a figure
end connecting Ri and Rs is the correlation of like this. The key part of the program in Mplus is
the latent intercept growth factor and latent
slope growth factor. A positive correlation would intercept slope | BMI97@0 BMI98@1
indicate that people who start with a high BMI BMI99@2 BMI@3 BMI@4 BMI@;5 BMI@6.
(intercept) have a more rapidly increasing BMI
(slope) than people who start with a low BMI. Such The first name, intercept, which could be
a positive correlation is unfortunate both for anything such as i or alpha, will always be the
youth who have a very high initial BMI (for whom intercept. The second name, slope, which could
a bigger slope is extremely problematic) and for be anything such as s or beta, will always be the
youth with a very low BMI (who have a low or linear slope. The logical ‘‘or bar |’’ tells the pro-
even negative slope). gram this is a growth curve. Each path from the
How is the intercept identified? The intercept intercept to the observed scores must be 1.0. One
is often referred to as the constant. It represents only needs to specify the loadings for the slope
the base value to which some amount is added (BMI97 is set at 0, BMI98 is set at 1, etc.). The
or subtracted for each unit increase in the predic- Mplus program reads this single line and knows
tor. This constant base value is identified by hav- that the model in Figure 3 is being estimated. It is
ing a fixed loading of 1.0 from it to each year’s possible to override the assumptions of the
measurement of BMI. These lines are the dashed program, such as specifying that the residual is not
lines with an arrow going from the latent inter- correlated or that some the measurement errors
cept to the individual BMI scores. The tradi- are correlated. These assumptions depend on res-
tional meaning of a slope is how much a variable earch goals and hypotheses.
692 Latent Growth Modeling
BMI98
BMI99
BMI00
BMI01
BMI02
BMI03
degree of freedom. This does not provide a very
rigorous test of a model, but it demonstrates that
Sample Means, General Estimated Means, General it is possible to estimate a linear growth curve with
just three waves of data.
What if there were four waves? If one counted
Figure 4 Plot of BMI Growth Trajectory for Actual the means, variances, and covariances, there would
and Estimated Means be 14 pieces of information instead of 9. However,
only one more parameter, E00, would be esti-
mated, so there would be 14 9 ¼ 5 degrees of
decrease in BMI. Clearly, there is an important
freedom. This gives a better test of a linear model.
random effect for both the intercept and the slope.
It also allows a quadratic term to be estimated to
The covariance between the intercept and slope
fit a curve. Adding a quadratic adds four para-
is .408, z ¼ 5.559, p < .001 and the correlation
meters: Mquadratic, RQ, and the covariances of the
is r ¼ .208, z ¼ 5.195, p < .001. [Mplus has
quadratic with both the intercept and the linear
a different test of significance for unstandardized
growth factors. It is good to have four waves of
and standardized coefficients, see Muthen (2007).]
data for a linear growth curve, although three is
This correlation between the intercept and slope
the minimum and it is good to have at least five
indicates exactly the area of concern, namely, that
waves of data for a nonlinear growth curve,
the youth who had the highest initial BMI scores
although four is the minimum.
also had the highest rate of growth in their BMI.
A plot of the linear latent growth curve can also
be examined. In Figure 4, it can be observed that Time-Invariant Covariates
a straight line slightly overestimates the initial
mean and slightly underestimates the final mean. Whenever there is a significant variance in the
This suggests that a quadratic can be added to the intercept or slope, these random effects should be
growth curve to capture the curvature in the explained. For example, whites and nonwhites
observed data. When this is done, an excellent fit might be compared on their BMI. In this example,
to the data is obtained. after dropping Asians and Pacific Islanders, the
nonwhites are primarily African Americans and
Latinos. Whites might have a lower intercept and
a flatter slope then nonwhites in their BMI. If
How Many Waves of Data Are Needed?
this were true, then race/ethnicity would explain
A latent growth model tries to reproduce the a portion of the random effects. Race/ethnicity is
summary statistics describing the data. If there are a time-invariant covariate.
just three waves of data, then the researcher would Consider emotional problems as a covariate
have three means, M97, M98, M99; three variances, that might explain some of the variance in the ran-
Var(BMI97), Var(BMI98), and Var(BMI99); and dom effects. If this is measured at age 12 and not
three covariances Cov(BMI97, BMI98), Cov(BMI97, again, it would be treated as a time-invariant cov-
BMI98), and Cov(BMI98, BMI99) for a total of nine ariate. Children who have a high score on emo-
pieces of information he or she is trying to repro- tional problems at age 12 might have a different
duce. How many parameters are being estimated? growth trajectory than children who have a low
694 Latent Growth Modeling
Time-Varying Covariates
E97 E98 E99 E00 E01 E02 E03
In Figure 6, time-invariant cov-
ariates are represented by the rect-
angle labeled W. This represents
Figure 5 Latent Growth Curve With Two Time-Invariant Covariates a vector of possible time-invariant
covariates that will influence the
growth trajectory. It is possible to
score on emotional problems. Alternatively, if extend this to include time-varying covariates.
emotional problems are measured at each wave, it Time-varying covariates either are measured after
would be a time-varying covariate. In this section, the process has started or have a value that
race/ethnicity and age 12 emotional problems as changes (hours of nutrition education or level of
time-invariant covariates are considered. There are program fidelity) from wave to wave. Although
other time-invariant covariates that should be con- output is not shown, Figure 5 illustrates the use
sidered, all of which would be measured just one of time-varying covariates. In Figure 6, the time-
time when the youth was 12 years old. For exam- varying covariates A1 to A6 might be the number
ple, mothers’ and fathers’ BMI, knowledge of food of hours of curriculum devoted to nutrition educa-
choices, proximity of home to a fast-food restau- tion each year.
rant, and many other time-invariant covariates Time-varying covariates do not have a direct
could be measured. influence on the intercept or slope. For example,
Figure 5 shows the example model with these the amount of nutrition education youth received
two covariates added. This has been called the con- in 2003 could influence neither their initial BMI
ditional latent trajectory modeling because the ini- in 1997 nor their growth rate in earlier years.
tial level and trajectory (slope) are conditional on Instead, the hours of curriculum devoted to nutri-
other variables. White is a binary variable coded tion education each year would provide a direct
0 for nonwhite and 1 for white. It is in a rectangle effect on their BMI that year. A year with a strong
Latent Growth Modeling 695
have as much effect on the number of cigarettes See also Growth Curve; Structural Equation Modeling
smoked (count component). Finding such distinc-
tions can provide a much better understanding of
Further Readings
an intervention’s effectiveness and can give ideas
for how to improve the intervention. Barrett, P. (2006). Structural equation modeling:
The last of many possibilities to be mentioned Adjudging model fit. Personality and Individual
here involves applications of growth mixture mod- Differences, 42, 815824.
els. A sample can be treated as representing a single Bollen, K. A., & Curran, P. J. (2006). Latent curve
population, when there might be multiple popula- models: A structural equation perspective. Hoboken,
NJ: John Wiley.
tions represented and these multiple populations
Center for Human Resources. (2007). NLSY97 user’s
might have sharp differences. If one were to create guide: A guide to rounds 19 data. Washington, DC:
a growth curve of abusive drinking behavior from U.S. Department of Labor.
age 18 to 37, one will find a growth curve that Curran, F. J., & Hussong, A. M. (2003). The use of latent
generally increases from age 18 to 23 and then trajectory models in psychopathology research.
decreases after that. However, this overall growth Journal of Abnormal Psychology, 112, 526544.
model might not fit the population very well. Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2006).
Why? What about people who never drink? These An introduction to latent variable growth curve
people have an intercept at or close to zero and modeling: Concepts, issues, and applications (2nd ed.).
Mahwah, NJ: Lawrence Erlbaum.
a flat slope that is near zero. This is an identifiable
Latent growth curves. Retrieved July 2, 2009, from
population for which the overall growth curve is
http://oregonstate.edu/~acock/growth
inapplicable. What about alcoholics? They might Mplus homepage. Retrieved January 27, 2010, from
be similar to the overall pattern up to age 23, but http://www.statmodel.com
then they do not decrease. Mixture models seek to Muthen, B. (2007). Mplus technical appendices:
find clusters of people who have homogeneous Standardized coefficients and their standard errors.
growth trajectories. Bengt Muthen and Linda Los Angeles. Retrieved January 27, 2010, from http://
Muthen applied a growth mixture model and were statmodel.com/download/techappen.pdf
able to identify three clusters of people that we Muthen, B., & Muthen, L. (2000). The development of
can label the normative group, the nondrinkers, heavy drinking and alcohol-related problems from
ages 18 to 37 in a U.S. national sample. Journal of
and the likely alcoholics. Once identifying group
Studies of Alcohol, 70, 290300.
membership, a profile analysis can be performed
Schreiber, J., Stage, R., King, J., Nora, A., & Barlow, E.
to evaluate how these groups differ on other vari- (2006). Reporting structural equation modeling and
ables. The same intervention would not work for confirmatory factor analysis results: A review. The
the alcoholic that works for the normative group, Journal of Educational Research, 99, 323337.
and it is not cost effective to have an intervention Yu, C. Y. (2002). Evaluating cutoff criteria of model fit
on the group of nondrinkers. indices for latent variable models with binary and
continuous outcomes (Unpublished dissertation).
University of California, Los Angeles.
Future Directions
Latent growth curve modeling is one of the most
important advances in the treasure chest of research LATENT VARIABLE
methods that has developed in the last 20 years. It
allows researchers to focus on change, what
explains the rate of change, and the consequences A latent variable is a variable that cannot be
of change. It is applicable across a wide range of observed. The presence of latent variables, how-
subject areas and can be applied to data of all levels ever, can be detected by their effects on variables
of measurement. It is also an area of rapid develop- that are observable. Most constructs in research
ment and will likely continue to change the way are latent variables. Consider the psychological
researchers work with longitudinal data. construct of anxiety, for example. Any single
observable measure of anxiety, whether it is a self-
Alan C. Acock report measure or an observational scale, cannot
Latent Variable 697
provide a pure measure of anxiety. Observable indicators can be accounted for by a relatively
variables are affected by measurement error. Mea- small number of latent variables.
surement error refers to the fact that scores often The measure of the degree to which an indicator
will not be identical if the same measure is given is associated with a latent variable is the indicator’s
on two occasions or if equivalent forms of the loading on the latent variable. An inspection of the
measure are given on a single occasion. In addi- pattern of loadings and other statistics is used to
tion, most observable variables are affected by identify latent variables and the observed variables
method variance, with the results obtained using that are associated with them. Principal compo-
a method such as self-report often differing from nents are latent variables that are obtained from an
the results obtained using a different method such analysis of a typical correlation matrix with 1s on
as an observational rating scale. Latent variable the diagonal. Because the variance on the diagonal
methodologies provide a means of extracting a rel- of a correlation matrix is a composite of common
atively pure measure of a construct from observed variance and unique variance including measure-
variables, one that is uncontaminated by measure- ment error, principal components differ from fac-
ment error and method variance. The basic idea is tors in that they capture unique as well as shared
to capture the common or shared variance among variance among the indicators. Because all variance
multiple observable variables or indicators of is included in the analysis and exact scores are
a construct. Because measurement error is by defi- available, principal components analysis primarily
nition unique variance, it is not captured in the is useful for ‘‘boiling down’’ a large number of
latent variable. Technically, this is true only when observed variables into a manageable number of
the observed indicators are (a) obtained in differ- principal components.
ent measurement occasions, (b) have different con- In contrast, the factors that result from factor
tent, and (c) have different raters if subjective analysis are latent variables obtained from an analy-
scoring is involved. Otherwise, they will share a sis of a correlation matrix after replacing the 1s on
source of measurement error that can be captured the diagonal with estimates of each observed vari-
by a latent variable. When the observed indicators able’s shared variance with the other variables in
represent multiple methods, the latent variables the analysis. Consequently, factors capture only the
also can be measured relatively free of method var- common variance among observed variables and
iance. This entry discusses two types of methods exclude measurement error. Because of this, princi-
for obtaining latent variables: exploratory and pal factor analysis is better for exploring the under-
confirmatory. In addition, this entry explores the lying factor structure of a set of observed variables.
use of latent variables in future research.
Confirmatory Methods
Exploratory Methods
Latent variables can also be identified using confir-
Latent variables are linear composites of observed matory methods such as confirmatory factor anal-
variables. They can be obtained by exploratory or ysis and structure equation models with latent
confirmatory methods. Two common exploratory variables, and this is where the real power of latent
methods for obtaining latent variables are factor variables is unleashed. Similar to exploratory fac-
analysis and principal components analysis. Both tor analysis, confirmatory factor analysis captures
approaches are exploratory in that no hypotheses the common variance among observed variables.
typically are proposed in advance about the num- However, predictions about the number of latent
ber of latent variables or which indicators will be variables and about which observed indicators are
associated with which latent variables. In fact, the associated with them are made a priori (i.e., prior
full solutions of factor analyses and principal com- to looking at the results) based on theory and prior
ponents analyses have as many latent variables as research. Typically, observed indicators are only
there are observed indicators and allow all indica- associated with a single latent variable when con-
tors to be associated with all latent variables. firmatory methods are used. The a priori predic-
What makes exploratory methods useful is when tions about the number of latent variables and
most of the shared variance among observed which indicators are associated with them can be
698 Latin Square Design
1 2 3 4 1 2 3 4
Further Readings 1 A C D B 1 1 3 4 2
This table tells us that the plant at row 1 and The left-cancellation law states that no symbol
column 1 will use fertilizer A, the plant at row 1 appears in any column more than once, and the
and column 2 will use fertilizer C, and so on. If we right-cancellation law states that no symbol
rename A to 1, B to 2, C to 3, and D to 4, then we appears in any row more than once. The unique-
obtain a square in the (b) portion of the table, image property states that each cell of the square
which is identical to the square in the chess can hold at most one symbol.
schedule. A Latin square can be also defined by a set of
In mathematics, all three squares in the previous triples. Let us look at the first Latin square [i.e.
section (without the row and column names) are square (a)].
called a Latin square of order four. The name
Latin square originates from mathematicians of
the 19th century like Leonhard Euler, who used 1 2 3 1 2 3
Latin characters as symbols.
2 3 1 3 1 2
3 1 2 2 3 1
Various Definitions (a) (b)
n= 1 2 3 4 5 6 7 8 9 10
Nonisotopic 1 1 1 2 2 22 563 1,676,267 115,618,721,533 208,904,371,354,363,006
Reduced 1 1 1 4 56 9,408 16,942,080 535,281,401,856 3.77 × 1018 7.58 × 1025
Latin Square Design 701
5 6 3 4 2, 3 3, 4 4, 1 1, 2 5, 5
4 3 5 6 5 4 3
4 3 3 5 4 5 6 2 1 (c)
3 4 3 4 5 2 1 6 5 1 2
4 3 1 2 5 3 1 4 2 3 4 2 1 The orthogonality of Latin squares is perhaps
3 4 2 1 4 5 2 1 3 4 3 1 2 the most important property in the study of Latin
squares. One problem of great interests is to prove
(a) (b) (c)
the existence of a set of mutually orthogonal Latin
squares (MOLS) of certain order.
Similarly, we might have more than one hole.
This can be demonstrated by Euler’s 36 Officers
For instance, (c) is a holey Latin square of order 6,
Problem, in which one attempts to arrange 36 offi-
with holes {1,2}, {3,4}, and {5,6}. Holey Latin
cers of 6 different ranks and 6 different regiments
squares with multiple holes (not necessarily mutu-
into a square so each line contains 6 officers of dif-
ally disjoint or same size) can be defined similarly
ferent ranks and regiments.
using the triple representation. Obviously, holey
If the ranks and regiments of these 36 officers
Latin squares are a special case of partial Latin
arranged in a square are represented, respectively,
squares. A necessary condition for the existence of
by two Latin squares of order six, then Euler’s 36
a holey Latin square is that the hole size cannot
officers problem asks whether two orthogonal Latin
exceed the half of the order. There are few results
squares of order 6 exist. Euler went on to conjecture
concerning the maximal number of holey Latin
that such an n × n array does not exist for n ¼ 6,
squares for various orders.
and one does not exist whenever n ≡ 2 (mod). This
was known as the Euler conjecture until its disproof
Orthogonal Latin Squares in 1959. At first, Raj C. Bose and Sharadchandra S.
Shrikhande found some counterexamples; the next
Given a Latin square of order n, there are n2 entry
year, Ernest Tilden Parker, Bose, and Shrikhande
positions. Given a set of n symbols, there are n2 dis-
were able to construct a pair of orthogonal order 10
tinct pairs of symbols. If we overlap two Latin
Latin squares, and they provided a construction for
squares of order n, we obtain a pair of symbols
the remaining even values of n that are not divisible
at each entry position. If the pair at each entry posi-
by 4 (of course, excepting n ¼ 2 and n ¼ 6).
tion is distinct compared to the other entry posi-
Today’s computer software can find a pair of such
tions, we say the two Latin squares are orthogonal.
Latin squares in no time. However, it remains a great
The following are some pairs of orthogonal Latin
challenge to find a set of three mutually orthogonal
squares of small orders. There is a pair of numbers
Latin squares of order 10.
in each entry; the first of these comes from the first
Let M(n) be the maximum number of Latin
square and the second from the other square.
squares in a set of MOLS of order n. The follow-
ing results are known: If n is a prime power, that
1, 1 2, 3 3, 4 4, 2 is, n ¼ pe, where p is a prime, then M(n) ¼ n
1. For small n > 6 and n is not a prime power, we
1, 1 2, 3 3, 2 2, 2 1, 4 4, 3 3, 1
do not know the exact value of M(n) except the
2, 2 3, 1 1, 3 3, 3 4, 1 1, 2 2, 4 lower bounds as given in the following table:
3, 3 1, 2 2, 1 4, 4 3, 2 2, 1 1, 3 n 6 10 12 14 15 18 20 21 22 24
(a) (b) MðnÞ 1 ≥ 2 ≥ 5 ≥ 3 ≥ 4 ≥ 3 ≥ 4 ≥ 5 ≥ 3 ≥ 5
Law of Large Numbers 703
MOLS can be used to design experiments. Sup- Note: Partially supported by the National Science Founda-
pose a drug company has four types of headache tion under Grants CCR-0604205.
drugs, four types of fever drugs, and four types of
cough drugs. To design a new cold medicine, the Further Readings
company wants to test the combinations of these
three kinds of drugs. In a test, three drugs (not the Bennett, F., & Zhu, L. (1992). Conjugate-orthogonal
same type) will be used simultaneously. Can we Latin squares and related structures. In J. Dinitz &
D. Stinson (Eds.), Contemporary design theory: A
design only 16 tests so that every pair of drugs
collection of surveys. New York: John Wiley.
(not the same type) will be tested? The answer is Bose, R. C., & Shrikhande, S. S. (1960). On the
yes, as we have a set of three MOLS of order four. construction of sets of mutually orthogonal Latin
A pair of MOLS is equivalent to a transversal squares and falsity of a conjecture of Euler.
design of index one. Transactions of the American Mathematical Society,
People are also interested in whether a Latin 95, 191209.
square is orthogonal to its conjugate and the Carley, A. (1890). On Latin squares. Oxford Cambridge
existence of mutually orthogonal holey Latin Dublin Messenger Mathematics, 19, 135137.
squares. For instance, for the two orthogonal Colbourn, C. J. (1984). The complexity of completing
squares of order 5 in the beginning of this sec- partial Latin squares. Discrete Applied Mathematics,
8, 2530.
tion [i.e., (c)], one is the other’s (2,1,3) conju-
Colbourn, C. J., & Dinitz, J. H. (Eds.). (1996). The CRC
gate. It has been known that a Latin square handbook of combinatorial designs. Boca Raton,
exists that is orthogonal to its (2,1,3), (1,3,2), FL: CRC Press.
and (3,2,1) conjugates for all orders except 2, 3, Euler, L. (1849). Recherches sur une espèe de carr magiques.
and 6; a Latin square exists that is orthogonal to Commentationes Arithmeticae Collectae, 2, 302361.
its (2,3,1) and (3,1,2) conjugates for all orders Evans, T. (1975). Algebraic structures associated with
except 2, 3, 4, 6, and 10. For holey Latin Latin squares and orthogonal arrays. Proceedings of
squares, the result is less conclusive. Conf. on Algebraic Aspects of Combinatorics,
Congressus Numerantium, 13, 3152.
Hedayat, A. S., Sloane, N. J. A., & Stufken, J. (1999).
Applications Orthogonal arrays: Theory and applications. New
The two examples in the beginning of this entry York: Springer-Verlag.
Mandl, R. (1985). Orthogonal Latin squares: An
show that Latin squares can be used for tourna-
application of experiment design to compiler testing.
ment scheduling and experiment design. This strat- Communications of the ACM, 28, 10541058.
egy has also been used for designing puzzles and McKey, B. D., & McLeod, J. C. (2006). The number of
tests. As a matching procedure, Latin squares transversals in a Latin square. Des Codes Cypt, 40,
relate to problems in graph theory, job assignment 269284.
(or Marriage Problem), and, more recently, proces- Royle, G. (n.d.). Minimum Sudoku. Retrieved January
sor scheduling for massively parallel computer sys- 27, 2010, from http://people.csse.uwa.edu.au/gordon/
tems. Algorithms for solving the Marriage sudokumin.php
Problem are also used in linear algebra to reduce Tarry, G. (1900). Le probleme de 36 officiers. Compte
Rendu de l’Association Francaise Avancement Sciences
matrices to block diagonal form.
Naturelle, 1, 122123.
Latin squares have rich connections with many
Zhang, H. (1997). Specifying Latin squares in
fields of design theory. A Latin square is also propositional logic. In R. Veroff (Ed.), Automated
equivalent to a (3,n) net, an orthogonal array of reasoning and its applications: Essays in honor of
strength two and index one, a 1-factorization of Larry Wos. Cambridge: MIT Press.
the complete bipartite graph Kn,n, an edge-parti-
tion of the complete tripartite graph Kn,n,n into tri-
angles, a set of n2 mutually nonattacking rooks on
a n × n × n board, and a single error-detecting LAW OF LARGE NUMBERS
code of word length 3, with n2 words from an
n-symbol alphabet. The Law of Large Numbers states that larger sam-
Hantao Zhang ples provide better estimates of a population’s
704 Law of Large Numbers
parameters than do smaller samples. As the size of samples of only 10 randomly selected men, it is
a sample increases, the sample statistics approach easily possible to get an unusually tall group of
the value of the population parameters. In its sim- 10 men or an unusually short group of men.
plest form, the Law of Large Numbers is some- Additionally, in such a small group, one outlier,
times stated as the idea that bigger samples are for example who is 85 inches, can have a large
better. After a brief discussion of the history of the effect on the sample mean. However, if samples
Law of Large Numbers, the entry discusses related of 100 men were drawn from the population,
concepts and provides a demonstration and the the means of those samples would vary less than
mathematical formula. the means from the samples of 10 men. It is
much more difficult to select 100 tall men ran-
domly from the population than it is to select 10
tall men randomly. Furthermore, if samples of
History
1,000 men are drawn, it is extremely unlikely
Jakob Bernoulli first proposed the Law of Large that 1,000 tall men will be randomly selected.
Numbers in 1713 as his ‘‘Golden Theorem.’’ The mean heights for those samples would vary
Since that time, numerous other mathematicians even less than the means from the samples of
(including Siméon-Denis Poisson who first 100 men. Thus, as sample sizes increase, the var-
coined the term Law of Large Numbers in 1837) iability between sample statistics decreases. The
have proven the theorem and considered its sample statistics from larger samples are, there-
application in games of chance, sampling, and fore, better estimates of the true population
statistical tests. Understanding the Law of Large parameters.
Numbers is fundamental to understanding the
essence of inferential statistics, that is, why one
can use samples to estimate population para- Demonstration
meters. Despite its primary importance, it is
often not fully understood. Consequently, the If a fair coin is flipped a million times, we expect
understanding of the concept has been the topic that 50% of the flips will result in heads and 50%
of numerous studies in mathematics education in tails. Imagine having five people flip a coin 10
and cognitive psychology. times so that we have five samples of 10 flips.
Suppose that the five flippers yield the following
results:
samples will have 10% or 20% heads. In fact, we Ferguson, T. S. (1996). A course in large sample theory.
would quickly observe that although the mean of New York: Chapman & Hall.
the sample statistics will be equal to the popula- Hald, A. (2007). A history of parametric statistical
tion mean of 50% heads, the sample statistics will inference from Bernoulli to Fisher. New York:
Springer-Verlag.
vary much less than did the statistics for samples
of 10 flips.
which however, did not diminish the popularity of where E stands for ‘‘error,’’ which is the quantity
this technique. to be minimized. The estimation of the parameters
The use of LSM in a modern statistical frame- is obtained using basic results from calculus and,
work can be traced to Sir Francis Galton who specifically, uses the property that a quadratic
used it in his work on the heritability of size, expression reaches its minimum value when its
which laid down the foundations of correlation derivatives vanish. Taking the derivative of E with
and (also gave the name to) regression analysis. respect to a and b and setting them to zero gives
The two antagonistic giants of statistics Karl the following set of equations (called the normal
Pearson and Ronald Fisher, who did so much in equations):
the early development of statistics, used and
developed it in different contexts (factor analysis ∂E X X
for Pearson and experimental design for Fisher). ¼ 2Na þ 2b Xi 2 Yi ¼ 0 ð3Þ
∂a
Nowadays, the LSM exists with several varia-
tions: Its simpler version is called ordinary least
and
squares (OLS), and a more sophisticated version is
called weighted least squares (WLS), which often X X X
performs better than OLS because it can modulate ∂E
¼ 2b X2i þ 2a Xi 2 Yi Xi ¼ 0: ð4Þ
the importance of each observation in the final ∂b
solution. Recent variations of the least square
method are alternating least squares (ALS) and Solving the normal equations gives the following
partial least squares (PLS). least-squares estimates of a and b as:
a ¼ MY bMX ð5Þ
Functional Fit Example: Regression
with My and MX denoting the means of X and Y,
The oldest (and still the most frequent) use of and
OLS was linear regression, which corresponds to
the problem of finding a line (or curve) that best P
ðYi MY ÞðXi MX Þ
fits a set of data points. In the standard formula- b¼ P 2
: ð6Þ
tion, a set of N pairs of observations fYi ; Xi g is ðXi MX Þ
used to find a function relating the value of the
dependent variable (Y) to the values of an inde- OLS can be extended to more than one indepen-
pendent variable (X). With one variable and dent variable (using matrix algebra) and to nonlin-
a linear function, the prediction is given by the ear functions.
following equation:
^ ¼ a þ bX:
Y ð1Þ The Geometry of Least Squares
OLS can be interpreted in a geometrical frame-
This equation involves two free parameters that work as an orthogonal projection of the data
specify the intercept (a) and the slope (b) of the vector onto the space defined by the independent
regression line. The least-squares method defines variable. The projection is orthogonal because
the estimate of these parameters as the values that the predicted values and the actual values are
minimize the sum of the squares (hence the name uncorrelated. This is illustrated in Figure 1,
least squares) between the measurements and the which depicts the case of two independent vari-
model (i.e., the predicted values). This amounts to ables (vectors X1 and X2) and the data vector
minimizing the expression: (y), and it shows that the error vector (y ^
y) is
X X orthogonal to the least-squares (^y) estimate,
2
^iÞ ¼ 2
E¼ ðYi Y ½Yi ða þ bXi Þ , ð2Þ which lies in the subspace defined by the two
i i independent variables.
Least Squares, Methods of 707
See also Bivariate Regression; Correlation; Multiple could just have easily been assigned #20 instead.
Regression; Pearson Product-Moment Correlation The important point is that each player was
Coefficient; assigned a number. Second, it is also important to
notice that the number or label is assigned to
Further Reading
a quality of the variable or outcome. Each thing
that is measured generally measures only one
Abdi, H., Valentin, D., & Edelman, B. E. (1999). Neural aspect of that variable. So one could measure an
networks. Thousand Oaks, CA: Sage. individual’s weight, height, intelligence, or shoe
Bates, D. M., & Watts, D. G. (1988). Nonlinear size, and one would discover potentially important
regression analysis and its applications. New York:
information about an aspect of that individual.
John Wiley.
Greene, W. H. (2002). Econometric analysis. New York:
However, just knowing a person’s shoe size does
Prentice Hall. not tell everything there is to know about that
Harper, H. L. (1974). The method of least squares and individual. Only one piece of the puzzle is known.
some alternatives. Part I, II, III, IV, V, VI. International Finally, it is important to note that the numbers or
Statistical Review, 42, 147174; 235264. labels are not assigned willy-nilly but rather
Harper, H. L. (1975). The method of least squares and according to a set of rules. Following these rules
some alternatives. Part I, II, III, IV, V, VI. International keeps the assignments constant, and it allows other
Statistical Review, 43, 144; 125190; 269272. researchers to feel confident that their variables are
Harper, H. L. (1976). The method of least squares and measured using a similar scale to other researchers,
some alternatives. Part I, II, III, IV, V, VI. International
which makes the measurements of the same quali-
Statistical Review, 44, 113159.
Nocedal, J., & Wright, S. (1999). Numerical
ties of variables comparable.
optimization. New York: Springer-Verlag. These scales (or levels) of measurement were
Plackett, R. L. (1972). The discovery of the method of first introduced by Stanley Stevens in 1946. As
least squares. Biometrika, 59, 239251. a psychologist who had been debating with other
Seal, H. L. (1967). The historical development of the scientists and mathematicians on the subject of
Gauss linear model. Biometrika, 54, 123. measurement, he proposed what is referred to
today as the levels of measurement to bring all
interested parties to an agreement. Stevens wanted
researchers to recognize that different varieties of
LEVELS OF MEASUREMENT measurement exist and that types of measurement
fall into four proposed classes. He selected the four
How things are measured is of great importance, levels through determining what was required to
because the method used for measuring the quali- measure each level as well as what statistical pro-
ties of a variable gives researchers information cesses could reasonably be performed with vari-
about how one should be interpreting those mea- ables measured at those levels. Although much
surements. Similarly, the precision or accuracy of debate has ensued on the acceptable statistical pro-
the measurement used can lead to differing out- cesses (which are explored later), the four levels of
comes of research findings, and it could potentially measurement have essentially remained the same
limit the statistical analyses that could be per- since their proposal so many years ago.
formed on the data collected.
Measurement is generally described as the assig-
nment of numbers or labels to qualities of a vari- The Four Levels of Measurement
able or outcome by following a set of rules. There
Nominal
are a few important items to note in this definition.
First, measurement is described as an assignment The first level of measurement is called nominal.
because the researcher decides what values to Nominal-level measurements are names or cate-
assign to each quality. For instance, on a football gory labels. The name of the level, nominal, is said
team, the coach might assign each team member to derive from the word nomin-, which is a Latin
a number. The actual number assigned does not prefix meaning name. This fits the level very well,
necessarily have any significance, as player #12 as the goal of the first level of measurement is
Levels of Measurement 709
to assign classifications or names to qualities of participants are different and which participant is
variables. If ‘‘type of fruit’’ was the variable of better (or worse) than another participant. Ordinal
interest, the labels assigned might be bananas, measurements convey information about order but
apples, pears, and so on. If numbers are used as still do not speak to amount. It is not possible to
labels, they are significant only in that their num- determine how much better or worse participants
bers are different but not in amount. For example, are at this level. So, Big Brown might have come in
for the variable of gender, one might code males ¼ first, the second place finisher came in 4 3/4 lengths
1 and females ¼ 2. This does not signify that there behind, and the third place finisher came in 3 1/2
are more females than males, or that females have lengths behind the second place finisher. However,
more of any given quality than males. The numbers because the time between finishers is different, one
assigned as labels have no inherent meaning at the cannot determine from the rankings alone (1st,
nominal level. Every individual or item that has 2nd, and 3rd) how much faster one horse is than
been assigned the same label is treated as if they another horse. Because the differences between
are equivalent, even if they might differ on other rankings do not have a constant meaning at this
variables. Note also from the previous examples level of measurement, researchers might determine
that the categories at the nominal level of that one participant is greater than another but not
measurement are discrete, which means mutually how much greater he or she is. Ordinal measure-
exclusive. A variable cannot be both male and ments are often used in educational research when
female in this example, only one or the other, much examining percentile ranks or when using Likert
as one cannot be both an apple and a banana. The scales, which are commonly used for measuring
categories must not only be discrete, but also they opinions and beliefs on what is usually a 5-point
must be exhaustive. That is, all participants must scale.
fit into one (and only one) category. If participants
do not fit into one of the existing categories, then
Interval
a new category must be created for them. Nomi-
nal-level measurements are the least precise level of The third level of measurement is called interval.
measurement and as such, tell us the least about Interval-level measurements are created with each
the variable being measured. If two items are mea- interval exactly the same distance apart. The name,
sured on a nominal scale, then it would be possible interval, is said to derive from the words inter-,
to determine whether they are the same (do they which is a Latin prefix meaning between, and val-
have the same label?) or different (do they have dif- lum, which is a Latin word meaning ramparts. The
ferent labels?), but it would not possible to identify purpose of this third level of measurement is to
whether one is different from the other in any allow researchers to compare how much greater
quantitative way. Nominal-level measurements are participants are than each other. For example,
used primarily for the purposes of classification. a hot day that is measured at 96 8F is 208 hotter
than a cooler day measured at 76 8F, and that is
the same distance as an increase in temperature
Ordinal
from 43 8F to 63 8F. Now, it is possible to say that
The second level of measurement is called ordi- the first day is hotter than the second day and also
nal. Ordinal-level measurements are in some form how much hotter it is. Interval is the lowest level
of order. The name, ordinal, is said to derive from of measurement that allows one to talk about
the word ordin-, which is a Latin prefix meaning amount. A piece of information that the interval
order. The purpose of this second level of measure- scale does not provide to researchers is a true zero
ment is to rank the size or magnitude of the quali- point. On an interval-level scale, whereas there
ties of the variables. For example, the order of might be a marking for zero, it is just a place
finish of the Kentucky Derby might be Big Brown holder. On the Fahrenheit scale, zero degrees does
#1, Eight Belles #2, and Denis of Cork #3. In the not mean that there is no heat, because it is possi-
ordinal level of measurement, there is not only cat- ble to measure negative degrees. Because of the
egory information, as in the nominal level, but also lack of an absolute zero at this level of measure-
rank information. At this level, it is known that the ment, although it is determinable how much
710 Levels of Measurement
greater one score is from another, one cannot deter- possible have a true zero of each (no socks, no
mine whether a score is twice as big as another. So cars, no items correct, and no errors).
if Jane scores a 6 on a vocabulary test, and John
scores a 2, it does not mean that Jane knows 3
times as much as John, because the zero point on Commonalities in the Levels of Measurement
the test is not a true zero, but rather an arbitrary
Even as each level of measurement is defined by fol-
one. Similarly, a zero on the test does not mean
lowing certain rules, the levels of measurement as
that the person being tested has zero vocabulary
a whole also have certain rules that must be fol-
ability. There is some controversy as to whether
lowed. All possible variables or outcomes can be
the variables that we measure for aptitude, intelli-
measured at least one level of measurement. Levels
gence, achievement, and other popular educational
of measurement are presented in an order from
tests are measured at the interval or ordinal levels.
least precise (and therefore least descriptive) to most
precise (and most descriptive). Within this order,
Ratio each level of measurement follows all of the rules of
the levels that preceded it. So, whereas the nominal
The fourth level of measurement is called ratio. level of measurement only labels categories, all the
Ratio-level measurements are unique among all levels that follow also have the ability to label cate-
the other levels of measurements because they have gories. Each subsequent level retains all the abilities
an absolute zero. The name ratio is said to derive of the level that came before. Also, each level of
from the Latin word ratio, meaning calculation. measurement is more precise than the ones before,
The purpose of this last level of measurement is to so the interval level of measurement is more exact
allow researchers to discuss not only differences in in what it can measure than the ordinal or nominal
magnitude but also ratios of magnitude. For exam- levels of measurement. Researchers generally
ple, for the ratio level variable of weight, one can believe that any outcome or variable should be
say that an object that weights 40 pounds is twice measured at the most precise level possible, so in
as heavy as an object that weights 20 pounds. This the case of a variable that could be measured at
level of measurement is so precise, it can be diffi- more than one level of measurement, it would be
cult to find variables that can be measured using more desirable to measure it at the highest level
this level. To use the ratio level, the variable of possible. For example, one could measure the
interest must have a true zero. Many social science weight of items in a nominal scale, assigning the
and education variables cannot be measured at this first item as ‘‘1,’’ the second item as ‘‘2,’’ and so on.
level because they simply do not have an absolute Or, one could measure the same items on an ordinal
zero. It is fairly impossible to have zero self- scale, assigning the labels of ‘‘light’’ and ‘‘heavy’’ to
esteem, zero intelligence, or zero spelling ability, the different items. One could also measure weight
and as such, none of those variables can be mea- on an interval scale, where one might set zero at
sured on a ratio level. In the hard sciences, more the average weight, and items would be labeled
variables are measurable at this level. For example, based on how their weights differed from the aver-
it is possible to have no weight, no length, or no age. Finally, one could measure each item’s weight,
time left. Similar to interval level scales, very few where zero means no weight at all, as weight is nor-
education and social science variables are mea- mally measured on a ratio scale. Using the highest
sured at the ratio level. One of the few common level of measurement provides researchers with the
variables at this measure is reaction time, which is most precise information about the actual quality
the amount of time that passes between when of interest, weight.
a stimulus happens and when a reaction to it is
noted. Other common occurrences of variables
that are measured at this level happen when the What Can Researchers Do
variable is measured by counting. So the number
With Different Levels of Measurement?
of errors, number of items correct, number of cars
in a parking lot, and number of socks in a drawer Levels of measurement tend to be treated flexibly
are all measured at the ratio level, because it is by researchers. Some researchers believe that there
Levels of Measurement 711
are specific statistical analyses that can only be might be performed. In addition, percentiles might
done at higher levels of measurement, whereas be calculated, although with caution, as some
others feel that the level of measurement of a vari- methods used for the calculation of percentiles
able has no effect on the allowable statistics that assume the variables are measured at the interval
can be performed. Researchers who believe in the level. The median might be calculated as a measure
statistical limitations of certain levels of measure- of central tendency. Quartiles might be calculated,
ment might also be fuzzy on the lines between the and some additional nonparametric statistics
levels. For example, many students learn about might be used. For example, Spearman’s rank-
a level of measurement called quasi-interval, on order correlation might be used to calculate the
which variables are measured on an ordinal scale correlation between two variables measured at the
but treated as if they were measured at the interval ordinal level.
scale for the purposes of statistical analysis. As
many education and social science tests collect
Interval
data on an ordinal scale, and because using an
ordinal scale might limit the statistical analyses At the interval level of measurement, almost
one could perform, many researchers prefer to every statistical tool becomes available. All the
treat the data from those tests as if they were mea- previously allowed tools might be used, as well as
sured on an interval scale, so that more advanced many that can only be properly used starting at
statistical analyses can be done. this level of measurement. The mean and standard
deviation, both frequently used calculations of cen-
tral tendency and variability, respectively, become
What Statistics Are Appropriate?
available for use at this level. The only statistical
Along with setting up the levels of measurement as tools that should not be used at this level are those
they are currently known, Stevens also suggested that require the use of ratios, such as the coeffi-
appropriate statistics that should be permitted to cient of variation.
be performed at each level of measurement. Since
that time, as new statistical procedures have been
Ratio
developed, this list has changed and expanded.
The appropriateness of some of these procedures is All statistical tools are available for data at this
still under debate, so one should examine the level.
assumptions of each statistical analysis carefully
before conducting it on data of any level of
Controversy
measurement.
As happens in many cases, once someone codifies
a set or rules or procedures, others proceed to put
Nominal
forward statements about why those rules are
When variables are measured at the nominal incorrect. In this instance, Stevens approached
level, one might count the number of individuals his proposed scales of measurement with this idea
or items that are classified under each label. One that certain statistics should be allowed to be
might also calculate central tendency in the form performed only on variables that had been mea-
of the mode. Another common calculation that sured at certain levels of measurement, much as
can be performed is the chi-square correlation, has been discussed previously. This point of view
which is otherwise known as the contingency has been called measurement directed, which
correlation. Some more qualitative analyses might means that the level of measurement used should
also be performed with this level of data. guide the researcher as to which statistical analysis
is appropriate. On the other side of the debate are
researchers who identify as measurement indepen-
Ordinal
dent. These researchers believe that it is possible
For variables measured at the ordinal level, all to conduct any type of statistical analysis, regard-
the calculation of the previous level, nominal, less of the variable’s level of measurement. An
712 Likelihood Ratio Statistic
whereas the corresponding likelihood for θ based on parameter vector θ might lie. The conventional test
the observed success count yOBS is often defined as statistic, which is often called the generalized like-
lihood ratio statistic, is given by
Lðθ; yOBS Þ ¼ θyOBS ð1 θÞnyOBS :
^ yOBS Þ,
Lðθ^0 ; yOBS Þ=Lðθ;
Now, consider two possible values for the
parameter vector θ, say θ0 and θ1 . The likelihood where Lðθ^0 ; yOBS Þ denotes the maximum value
ratio statistic Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ might be attained by the likelihood Lðθ; yOBS Þ as the para-
used to determine which of these two candidate meter vector θ varies over the space 0 , and
values is more ‘‘likely’’ (i.e., which is better sup- ^ yOBS Þ represents the maximum value attained
Lðθ;
ported by the data yOBS ). If the ratio is less than by Lðθ; yOBS Þ as θ varies over the combined space
one, θ1 is favored; if the ratio is greater than one, 0 ∪ 1 .
θ0 is preferred. Tests based on the generalized likelihood ratio
As an example, in a classroom experiment to are often optimal in terms of power. The size of
illustrate simple Mendelian genetic traits, suppose a test refers to the level of significance at which
that a student is provided with 20 seedlings that the test is conducted. A test is called uniformly
might flower either white or red. She is told to most powerful (UMP) when it achieves a power
plant these seedlings and to record the colors after that is greater than or equal to the power of any
germination. Let θ denote the probability of a seed- alternative test of comparable size. When no
ling flowering white. If Y denotes the number of UMP test exists, it might be helpful to restrict
seedlings among the 20 planted that flower white, attention to only those tests that can be classified
then Y might be viewed as arising from a binomial as unbiased. A test is unbiased when the power
distribution with density of the test never falls below its size [i.e., when
Prðreject H0 jθ ∈ 1 Þ ≥ Prðreject H0 jθ ∈ 0 Þ]. A
20 y
f ðy; θÞ ¼ θ ð1 θÞny : test is called uniformly most powerful unbiased
y
(UMPU) when it achieves a power that is greater
The student is told that θ is either θ0 ¼ 0:75 or than or equal to the power of any alternative
θ1 ¼ 0:50; she must use the outcome of her ex- unbiased test. The generalized likelihood ratio
periment to determine the correct probability. statistic can often be used to formulate UMP and
After planting the 20 seedlings, she observes that UMPU tests.
yOBS ¼ 13 flower white. In this setting, the likeli- The reliance of the likelihood ratio statistic in
hood ratio statistic statistical inference is largely based on the likeli-
hood principle. Informally, this principle states
Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ that all the information in the sample yOBS that is
relevant for inferences on the parameter vector θ
equals 1.52. Thus, the likelihood ratio implies that is contained within the likelihood function
the value θ0 ¼ 0:75 is the more plausible value for Lðθ; yOBS Þ. The likelihood principle is somewhat
the probability θ. Based on the ratio, the student controversial and is not universally held. For
should choose the value θ0 ¼ 0:75. instance, neglecting constants that do not involve θ
The likelihood ratio might also be used to the same likelihood might result from two differ-
test formally two competing point hypotheses ent experimental designs. In such instances, likeli-
H0 : θ ¼ θ0 versus H1 : θ ¼ θ1 . In fact, the Ney- hood-based inferences would be the same under
manPearson Lemma establishes that the power either design, although tests that incorporate the
of such a test will be at least as high as the power nature of the design might lead to different conclu-
of any alternative test, assuming that the tests are sions. For instance, consider the preceding genetics
conducted using the same levels of significance. example based on simple Mendelian traits. If a stu-
A generalization of the preceding test allows dent were to plant all n seedlings at one time and
one to evaluate two competing composite hypoth- to count the number Y that eventually flower
eses H0 : θ ∈ 0 versus H1 : θ ∈ 1 . Here, 0 and white, then the count Y would follow a binomial
1 refer to disjoint parameter spaces where the distribution. However, if the student were to plant
714 Likert Scaling
seedlings consecutively one at a time and continue had undertaken in 1929. The use of Likert items
until a prespecified number of seedlings flower red, and scaling is probably the most used survey meth-
then the number Y that flower white would follow odology in educational and social science research
a negative binomial distribution. Based on the ker- and evaluation.
nel, each experimental design leads to the same The Likert scale provides a score based on
likelihood. Thus, if the overall number of seedlings a series of items that have two parts. One part is
planted n and the observed number of white flow- the stem that is a statement of fact or opinion to
ering seedlings yOBS are the same in each design, which the respondent is asked to react. The other
then likelihood-based inferences such as the pre- part is the response scale. Likert was the
ceding likelihood ratio test would yield identical first recognized for the use of a 5-point, ordinal
results. However, tests based on the probability scale of strongly approve—approve—undecided—
distribution models f ðy; θÞ could yield different disapprove—strongly disapprove. The scale is
conclusions. often changed to other response patterns such as
The likelihood ratio Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ has strongly agree—agree—neutral—disagree—strongly
a simple Bayesian interpretation. Prior to the disagree. This entry discusses Likert’s approach
collection of data, suppose that the candidate and scoring methodology and examines the
values θ0 and θ1 are deemed equally likely, so that research conducted on Likert scaling and its
the prior probabilities Prðθ0 Þ ¼ Prðθ1 Þ ¼ 0:5 are modifications.
employed. By Bayes’s rule, the ratio of the poste-
rior probabilities for the two parameter values,
Likert’s Approach
Prðθ0 jyOBS Þ= Prðθ1 jyOBS Þ, In Likert’s original research, which led to Likert
scaling, Likert compared four ways of structuring
corresponds to the likelihood ratio. As this inter- attitude survey items believing that there was an
pretation would suggest, the concept of the likeli- alternative to the approach attributed to Louis
hood function and the likelihood principle both Leon Thurstone. Although both approaches were
play prominent roles in Bayesian inference. based on equal-interval, ordinal stepped scale
points, Likert considered Thurstone’s methods to
Joseph E. Cavanaugh and Eric D. Foster be a great deal of work that was not necessary.
See also Bayes’s Theorem; Directional Hypothesis;
Setting up a Thurstone scale involved the use of
Hypothesis; Power; Significance Level, Concept of
judges to evaluate statements to be included in the
survey. This included rank ordering the statements
in terms of the expected degree of the attribute
Further Readings being assessed and then comparing and ordering
Pawitan, Y. (2001). In all likelihood: Statistical modeling each pair of item possibilities, which is an onerous
and inference using likelihood. Oxford, UK: Oxford task if there were many item possibilities. Origi-
University Press. nally, each item was scored as a dichotomy (agree/
disagree or þ=Þ.
A Thurstone scale was scored in a similar man-
ner as Likert’s original method using sigma values,
LIKERT SCALING which were z scores weighted by the responses cor-
responding to the assumed equal interval
Likert (pronounced lick-ert) scaling is a method categories. However, part of the problem with
of attitude, opinion, or perception assessment of Thurstone’s scoring method related to having
a unidimensional variable or a construct made a spread of judge-determined 1 to 11 scoring cate-
up of multidimensions or subscales. It recognizes gories when scoring the extreme values of 0 or
the contribution to attitude assessment of Rensus 1 proportions. These could not be adequately
Likert who published a classic paper on this topic accounted for because they were considered as ±
in 1932, based on his doctoral dissertation directed infinity z values in a sigma scoring approach and
by Gardner Murphy and based on work Murphy thus were dropped from the scoring. Likert felt
Likert Scaling 715
there was another approach that did not rely so scaling involving the use of the mean or sum of
much on the use of judges and could include the scores from a set of items to represent a position
scoring of items where everyone either did not or on the attitude variable continuum.
did select the extreme score category by using the Although Likert focused on developing a unidi-
± 3.00 z values instead of ± ∞. Thus, Likert set mensional scale, applications of his methodology
out to use some of the features of a Thurstone have been used to develop multidimensional scales
scale but simplify the process and hope to achieve that include subscales intended to assess attitudes
a similar level of reliability found with the Thur- and opinions on different aspects of the construct
stone scale. His research met the goals he set out of interest.
to meet.
A stem or statement was presented related to
racial attitudes and then respondents were asked Modifications
to respond to one of several response sets. One Many modifications of Likert scaling use a wide
set used ‘‘yes / no’’ options, and another used nar- variety of response sets similar or not so similar to
rative statements, and two of them used what the ones Likert used that are called Likert-type
we now know as a Likert item, using strongly items. It seems that almost any item response set
approve—approve—undecided—disapprove— that includes ordered responses in a negative and
strongly disapprove as the response categories. positive direction gets labeled as a Likert item.
The distinction between the last two types of More than 35 variations of response sets have
items relates to the source of the questions as been identified, even though many of them vary
being developed specifically by Likert for asses- considerably from Likert’s original response set.
sing attitudes and the other as abbreviations Some even use figures such as smiley or frowny
of newspaper articles reflecting societal conflicts faces instead of narrative descriptions or abbrevia-
among race-based groups. tions. These scales have been used often with
children.
Likert’s Scoring Methodology
Likert found that many items he used had distribu- Research
tions resembling a normal distribution. He con-
Controversial Issues Related to
cluded that if these distributions resembled
Likert Items and Scales
a normal distribution, it was legitimate to deter-
mine a single unidimensional scale value by finding Research on Likert item stem and response con-
the mean or sum of the items and using that for struction, overall survey design, methods of scor-
a value that represented the attitude, opinion, or ing, and various biases has been extensive; it is
perception of the variable on a continuum. probably one of the most researched topics in
Sigma values were z scores weighted by the use social science. There are many controversial issues
of responses to the five categories. These were then and debates about using Likert items and scales,
used by item to estimate score reliabilities (using including the reading level of respondents, item
split-half and testretest approaches), which were reactivity, the length or number of items, the mode
found to be high. Likert also demonstrated a high of delivery, the number of responses, using an odd
level of concurrent validity between his approach or even number of responses, labeling of a middle
and Thurstone’s approach, even though he had response, the direction of the response categories,
used only about half the number of items that dealing with missing data, the lack of attending
Thurstone had used. behaviors, acquiescence bias, central tendency
Likert sings the praises of the sigma scoring bias, social desirability bias, the use of parametric
technique. However, he also discovered that sim- methods or nonparametric methods when compar-
ply summing the scores resulted in about the same ing scale indicators of central tendency (median or
degree of score reliability, for both split-half mean), and, probably most controversial, the use
and testretest score reliabilities, as the sigma of negatively worded items. All of these have the
approach. Thus was born the concept of Likert potential for influencing score reliability and
716 Likert Scaling
validity, some more than others, and a few actually These options can be used, but it is advisable not
increase the estimate of reliability. to put these as midpoints on an ordinal contin-
Because Likert surveys are usually in the cate- uum or to give them a score for scaling
gory of ‘‘self-administered’’ surveys, the reading purposes.
level of respondents must be considered. Typi- Another issue is the direction of the response
cally, a reading level of at least 5th grade is often categories. Options are to have the negative
considered a minimal reading level for surveys response set on the left side of the scale moving to
given to most adults in the general population. the right becoming more positive or having the pos-
Often, Edward Fry’s formula is used to assess itive response set on the left becoming more nega-
reading level of a survey. A companion issue is tive as the scale moves from left to right. There
when surveys are translated from one language does not seem to be much consensus on which is
to another. This can be a challenging activity better, so often the negative left to positive right is
that, if not done well, can reduce score reliabil- preferred.
ity. Related to this is the potential for reducing Missing data are as much an issue in Likert
reliability and validity when items are reactive scaling as in all other types of research. Often,
or stir up emotions in an undesirable manner a decision needs to be made relative to how
that can confound the measure of the attitudes many items need to be completed for the survey
of interest. Sometimes, Likert survey items are to be considered viable for inclusion in the data
read to respondents in cases where reading level set. Whereas there are no hard-and-fast rules for
might be an issue or clearly when a Likert scale making this decision, most survey administrators
is used in a telephone survey. Often, when read- would consider a survey with fewer than 80% of
ing Likert response options over the phone, it is the items completed not to be a viable entry.
difficult for some respondents to keep the cate- There are a few ways of dealing with missing
gories in mind, especially if they change in the data when there are not a lot of missed
middle of the survey. Other common modes of responses. The most common is to use the mean
delivery now include online Likert surveys used of the respondent’s responses on the completed
for myriad purposes. items as a stand-in value. This is done automati-
Survey length can also affect reliability. Even cally if the scored value is the mean of the
though one way of increasing score reliability is to answered responses. If the sum of items is the
lengthen a survey, making the survey too long and scale value, any missing items will need to have
causing fatigue or frustration will have the oppo- the mean of the answered items imputed into the
site effect. missing data points before summing the items to
One issue that often comes up is deciding on get the scaled score.
the number of response categories. Most survey
researchers feel three categories might be too
Response Bias
few and more than seven might be too many.
Related to this issue is whether to include an odd Several recognized biased responses can occur
or even number of response categories. Some feel with Likert surveys. Acquiescence bias is the
that using an even number of categories forces tendency of the respondent to provide positive
the respondent to chose one directional opinion responses to all or almost all of the items. Of
or the other, even if mildly so. Others feel there course, it is hard to separate acquiescence bias
should be an odd number of responses and the from reasoned opinions for these respondents.
respondent should have a neutral or nonagree or Often, negatively worded Likert stems are used to
nondisagree opinion. If there are an odd number determine whether this is happening based on the
of response categories, then care must be used in notion that if a respondent responded positively
defining the middle category. It should represent both to items worded in a positive direction as
a point of the continuum such as neither approve well as a negative direction, then they were more
nor disapprove, neither agree nor disagree, or likely to be exhibiting this biased behavior rather
neutral. Responses such as does not apply or than attending to the items. Central tendency bias
cannot respond do not fit the ordinal continuum. is the tendency to respond to all or most of the
Likert Scaling 717
items with the middle response category. Using an much promulgated by Likert, that item distribu-
even number of response categories is often a strat- tions are close to being normal and are thus addi-
egy employed to guard against this behavior. Social tive, giving an approximate interval scale. This
desirability bias is the tendency for respondents to would justify the use of z tests, t tests, and analysis
reply to items to reflect what they believe they of variance for group inferential comparisons and
would be expected to respond based on societal the use of Pearson’s r to examine variable relation-
norms or values rather than their own feelings. ships. Rasch modeling is often used as an approach
Likert surveys on rather personal attitudes or opi- for obtaining interval scale estimates for use in
nions related to behaviors considered by society inferential group comparisons if certain item char-
to be illegal, immoral, unacceptable, or personally acteristics are assumed. To the extent the item score
embarrassing are more prone to this problem. This distributions depart from normality, this assump-
problem is exacerbated if respondents have any tion would have less viability and would tend to
feeling that their responses can be directly or even call for the use of nonparametric methods.
indirectly attributed to them personally. The effect
of two of these behaviors on reliability is some-
Use of Negatively Worded
what predictable. It has been demonstrated that
or Reverse-Worded Likert Stems
different patterns of responses have differential
effects on Cronbach’s alpha coefficients. Acquies- Although there are many controversies about
cent (or the opposite) responses inflate Cronbach’s the use of Likert items and scales, the one that
alpha. Central tendency bias has little effect on seems to be most controversial is the use of reverse
Cronbach’s alpha. It is pretty much impossible to or negatively worded Likert item stems. This
determine the effect on alpha from social desirabil- has been a long recommended practice to guard
ity responses, but it would seem that there would against acquiescence. Many Likert item scholars
not be a substantial effect on it. still recommend this practice. It is interesting to
note that Likert used some items with positive atti-
tude stems and some with negative attitude stems
Inferential Data Analysis of
in all four of his types of items. However, Likert
Likert Items and Scale Scores
provides no rationale for doing this in his classic
One of the most controversial issues relates to work. Many researchers have challenged this prac-
how Likert scale data can be used for inferential tice as not being necessary in most attitude assess-
group comparisons. On an item level, it is pretty ment settings and as a practice that actually
much understood that the level of measurement reduces internal consistency score reliability. Sev-
is ordinal and comparisons on an item level eral researchers have demonstrated that this prac-
should be analyzed using nonparametric meth- tice can easily reduce Cronbach’s alpha by at least
ods, primarily using tests based on the chi-square 0.10. It has been suggested that the reversal of
probability distribution. Tests for comparing fre- Likert response sets for half of the items while
quency distributions of independent groups often keeping the stems all going in a positive direction
use a chi-square test of independence. When accomplishes the same purpose of using negatively
comparing Likert item results for the dependent worded Likert items.
group situation such as a pretestposttest arrange-
ment, McNemar’s test is recommended. The con-
Reliability
troversy arises when Likert scale data (either in the
form of item means or sums) are being compared Even though Likert used split-half methods for
between or among groups. Some believe this scale estimating score reliability, most of the time in cur-
value is still at best an ordinal scale and recom- rent practice, Cronbach’s alpha coefficient of in-
mend the use of a nonparametric test, such as a ternal consistency, which is also known as the
MannWhitney, Wilcoxon, or KruskalWallis Kuder-Richardson 20 approach, is used. Cron-
test, or the Spearman rank-order correlation if bach’s alpha is sometimes defined as the mean
looking for variable relationships. Others are will- split-half reliability coefficient if all the possible
ing to believe the assumption, which was pretty split-half coefficients are defined.
718 Line Graph
20 15
Females
15
Prevalence (%)
10
Number
10
Males
5
5
0
0
0 55–59 60–64 65–69 70–74 75+
2 12 22 32 42 52 62
Age
Number of Hours
1500
LISREL
A
Further Readings
Cleveland, W. S. (1985). The elements of graphing data. The General LISREL Model
Pacific Grove, CA: Wadsworth.
Robbins, N. B. (2005). Creating more effective graphs. In general, LISREL estimates the unknown coeffi-
Hoboken, NJ: John Wiley. cients of a set of linear structural equations. A full
Streiner, D. L., & Norman, G. R. (2007). Biostatistics: LISREL model consists of two submodels: the
The bare essentials (3rd ed.). Shelton, CT: PMPH. measurement model and the structural equation
722 LISREL
model. These models can be described by the fol- with LISREL matrices and their Greek representa-
lowing three equations: tion is helpful to master this program fully.
Coverage Audience
Coverage refers to the amount of literature on Literature reviews written to support an empiri-
which the review is based. At one extreme of this cal study are often read by specialized scholars in
dimension is exhaustive coverage, which uses all one’s own field. In contrast, many stand-alone
available literature. A similar approach is the reviews are read by those outside one’s own field,
exhaustive review with selective citation, in which so it is important that these are accessible to scho-
the reviewer uses all available literature to draw lars from other fields. Reviews can also serve as
conclusions but cites only a sample of this litera- a valuable resource for practitioners in one’s field
ture when writing the review. Moving along this (e.g., psychotherapists and teachers) as well as pol-
dimension, a review can be representative, such icy makers and the general public, so it is useful
that the reviewer bases conclusions on and cites if reviews are written in a manner accessible to
a subset of the existing literature believed to be educated laypersons. In short, the reviewer must
similar to the larger body of work. Finally, at the consider the likely audiences of the review and
far end of this continuum is the literature review adjust the level of specificity and technical detail
of most central works. accordingly.
All of these seven dimensions are important
considerations when preparing a literature review.
As might be expected, many reviews will have
Organization multiple levels of these dimensions (e.g., multiple
The most common organization is conceptual, goals directed toward multiple audiences). Tenden-
in which the reviewer organizes literature around cies exist for co-occurrence among dimensions; for
specific sets of findings or questions. However, his- example, quantitative reviews typically focus on
toric organizations are also useful, in that they pro- research outcomes, cover the literature exhaus-
vide a perspective on how knowledge or practices tively, and are directed toward specialized scho-
have changes across time. Methodological organi- lars. At the same time, consideration of these
zations, in which findings are arranged according dimensions suggests the wide range of possibilities
to methodological aspects of the reviewed studies, available in preparing literature reviews.
are also a possible method of organizing literature
reviews. Scientific Standards for Literature Reviews
Given the importance of literature reviews, it is
important to follow scientific standards in prepar-
Method of Synthesis ing these reviews. Just as empirical research fol-
lows certain practices to ensure validity, we can
Literature reviews also vary in terms of how
consider how various decisions impact the quality
conclusions are drawn, with the endpoints of this
of conclusions drawn in a literature review. This
continuum being qualitative versus quantitative.
section follows Harris Cooper’s organization by
Qualitative reviews, which are also called narra-
describing considerations at five stages of the liter-
tive reviews, are those in which reviewers draw
ature review process.
conclusions based on their subjective evaluation of
the literature. Vote counting methods, which might
be considered intermediate on the qualitative ver-
Problem Formulation
sus quantitative dimension, involve tallying the
number of studies that find a particular effect As in any scientific endeavor, the first stage of
and basing conclusions on this tally. Quantitative a literature review is to formulate a problem. Here,
reviews, which are sometimes also called meta- the central considerations involve the questions
analyses, involve assigning numbers to the results that the reviewer wishes to answer, the constructs
of studies (representing an effect size) and then of interest, and the population about which con-
performing statistical analyses of these results to clusions are drawn. A literature review can only
draw conclusions. answer questions about which prior work exists.
Literature Review 727
For instance, to make conclusions of causality, the and therefore might exclude most studies con-
reviewer will need to rely on experimental (or per- ducted in other countries. Although it would be
haps longitudinal) studies; concurrent naturalistic impractical for the reviewer to learn every lan-
studies would not provide answers to this ques- guage in which relevant literature might be writ-
tion. Defining the constructs of interest poses two ten, the reviewer should be aware of this
potential complications: The existing literature limitation and how it impacts the literature on
might use different terms for the same construct, which the review is based. To ensure transpar-
or the existing literature might use similar terms to ency of a literature review, the reviewer should
describe different constructs. The reviewer, report means by which potentially relevant liter-
therefore, needs to define clearly the constructs of ature was searched and obtained.
interest when planning the review. Similarly, the
reviewer must consider which samples will be
Inclusion Criteria
included in the literature review, for instance,
deciding whether studies of unique populations Deciding which works should inform the review
(e.g., prison, psychiatric settings) should be involves reading the literature obtained and draw-
included within the review. The advantages of ing conclusions regarding relevance. Obvious rea-
a broad approach (in terms of constructs and sons to exclude works include the investigation of
samples) are that the conclusions of the review constructs or samples that are irrelevant to the
will be more generalizable and might allow for review (e.g., studies involving animals when one is
the identification of important differences among interested in human behavior) or that do not pro-
studies, but the advantages of a narrow vide information relevant to the review (e.g., treat-
approach are that the literature will likely be ing the construct of interest only as a covariate).
more consistent and the quantity of literature Less obvious decisions need to be made with
that must be reviewed is smaller. works that involve questionable quality or meth-
odological features different from other studies.
Including such works might improve the generaliz-
Literature Retrieval
ability of the review on the one hand, but it might
When obtaining literature relevant for the contaminate the literature basis or distract focus
review, it is useful to conceptualize the literature on the other hand. Decisions at this stage will typi-
included as a sample drawn from a population cally involve refining the problem formulation
of all possible works. This conceptualization stage of the review.
highlights the importance of obtaining an unbi-
ased sample of literature for the review. If the lit-
Interpretation
erature reviewed is not exhaustive, or at least
representative, of the extant research, then the The most time-consuming and difficult stage is
conclusions drawn might be biased. One com- analyzing and interpreting the literature. As men-
mon threat to all literature reviews is publication tioned, several approaches to drawing conclusions
bias, or the file drawer problem. This threat is exist. Qualitative approaches involve the reviewer
that studies that fail to find significant effects (or performing some form of internal synthesis; as
that find counterintuitive effects) are less likely such, they are prone to reviewer subjectivity. At
to be published and, therefore, are less likely to the same time, qualitative approaches are the only
be included in the review. Reviewers should option when reviewing nonempirical literature
attempt to obtain unpublished studies, which (e.g., theoretical propositions), and the simplicity
will either counter this threat or at least allow of qualitative decision making is adequate for
the reviewer to evaluate the magnitude of this many purposes. A more rigorous approach is the
bias (e.g., comparing effects from published vs. vote-counting methods, in which the reviewer tal-
unpublished studies). Another threat is that lies studies into different categories (e.g., signifi-
reviewers typically must rely on literature writ- cant versus nonsignificant results) and bases
ten in a language they know (e.g., English); this decisions on either the preponderance of evidence
excludes literature written in other languages (informal vote counting) or statistical procedures
728 Logic of Scientific Discovery, The
(comparing the number of studies finding signifi- Card, N. A. (in press). Meta-analysis: Quantitative
cant results with that expected by chance). synthesis of social science research. New York:
Although vote-counting methods reduce subjec- Guilford.
tivity relative to qualitative approaches, they are Cooper, H. (1998). Synthesizing research: A guide for
literature reviews (3rd ed.). Thousand Oaks, CA: Sage.
limited in that the conclusions reached involve
Cooper, H., & Hedges, L. V. (Eds.). (1994). The
only whether there is an effect (rather than the handbook of research synthesis. New York: Russell
magnitude of the effect). The best way to draw Sage Foundation.
conclusions from empirical literature is through Hedges, L. V., & Olkin, I. (1985). Statistical methods for
quantitative, or meta-analytic, approaches. Here, meta-analysis. San Diego, CA: Academic Press.
the reviewer codes effect sizes for the studies then Pan, M. L. (2008). Preparing literature reviews:
applies statistical procedures to evaluate the pres- Qualitative and quantitative approaches (3rd ed.).
ence, magnitude, and sources of differences of Glendale, CA: Pyrczak Publishing.
these effects across studies. Rosenthal, R. (1995). Writing meta-analytic reviews.
Psychological Bulletin, 118, 183192.
Presentation
Although presentation formats are highly dis-
ciplinary specific (and therefore, the best way to LOGIC OF SCIENTIFIC
learn how to present reviews is to read reviews
in one’s area), a few guidelines are universal. DISCOVERY, THE
First, the reviewer should be transparent about
the review process. Just as empirical works are The Logic of Scientific Discovery first presented
expected to present sufficient details for replica- Karl Popper’s main ideas on methodology, includ-
tion, a literature review should provide sufficient ing falsifiability as a criterion for science and the
detail for another scholar to find the same litera- representation of scientific theories as logical sys-
ture, include the same works, and draw the same tems from which other results followed by pure
conclusions. Second, it is critical that the written deduction. Both ideas are qualified and extended
report answers the original questions that moti- in later works by Popper and his follower Imré
vated the review or at least describes why such Lakatos.
answers cannot be reached and what future Popper was born in Vienna, Austria, in 1902.
work is needed to provide these answers. A third During the 1920s, he was an early and enthusiastic
guideline is to avoid study-by-study listing. A participant in the philosophical movement called
good review synthesizes—not merely lists—the the Vienna Circle. After the rise of Nazism, he fled
literature (it is useful to consider that a phone- Austria for New Zealand, where he spent World
book contains a lot of information, but is not War II. In 1949, he was appointed Professor of
very informative, or interesting, to read). Logic and Scientific Method at the London School
Reviewers should avoid ‘‘Author A found . . . of Economics (LSE), where he remained for the
Author B found . . .’’ writing. Effective presenta- rest of his teaching career. He was knighted by
tion is critical in ensuring that the review has an Queen Elizabeth II in 1965. Although he retired in
impact on one’s field. 1969, he continued a prodigious output of philo-
sophical work until his death in 1994. He was
Noel A. Card succeeded at LSE by his protégé Lakatos, who
See also Effect Size, Measures of; File Drawer Problem; extended his methodological work in important
Meta-Analysis ways.
The Logic of Scientific Discovery’s central
methodological idea is falsifiability. The Vienna
Further Readings Circle philosophers, or logical positivists, had pro-
Bem, D. J. (1995). Writing a review article for posed, first, that all meaningful discourse was
Psychological Bulletin. Psychological Bulletin, 118, completely verifiable, and second, that science was
172177. coextensive with meaningful discourse. Originally,
Logic of Scientific Discovery, The 729
they meant by this that a statement should be con- paper (‘‘The Aim of Science,’’ reprinted in Objec-
sidered meaningful, and hence scientific, if and tive Knowledge, chapter 5) Popper pointed out
only if it was possible to show that it was true, that there was no deductive link between, for
either by logical means or on the basis of the evi- example, Newton’s laws and the original state-
dence of the senses. Popper became the most ments of Kepler’s laws of planetary motion or
important critic of their early work. He pointed Galileo’s law of fall. The simple logical model of
out that scientific laws, which are represented as science offered by Popper and later logical positi-
unrestricted or universal generalizations such as vists (e.g., Carl Hempel and Paul Oppenheim)
‘‘all planets have elliptical orbits’’ (Kepler’s Second therefore failed for some of the most important
Law), are not verifiable by any finite set of sense intertheoretical relations in the history of science.
observations and thus cannot be counted as mean- An additional limitation of falsifiability as pre-
ingful or scientific. To escape this paradox, Popper sented in The Logic of Scientific Discovery was the
substituted falsifiability for verifiability as the key issue of ad hoc hypotheses. Suppose, as actually
logical relation of scientific statements. He thereby happened between 1821 and 1846, a planet is
separated the question of meaning from the ques- observed that seems to have an orbit that is not an
tion of whether a claim was scientific. A statement ellipse. The response of scientists at the time was
could be considered scientific if it could, in princi- not to falsify Kepler’s law, or the Newtonian Laws
ple, be shown to be false on the basis of sensory of Motion and Universal Gravitation from which it
evidence, which in practice meant experiment or was derived. Instead, they deployed a variety of aux-
observation. ‘‘All planets have elliptical orbits’’ iliary hypotheses ad hoc, which had the effect of
could be shown to be false by finding a planet with explaining away the discrepancy between Newton’s
an orbit that was not an ellipse. This has never laws and the observations of Uranus, which led to
happened, but if it did the law would be counted the discovery of the planet Neptune. Cases like this
as false, and such a discovery might be made suggested that any claim could be permanently insu-
tomorrow. The law is scientific because it is falsifi- lated from falsifying evidence by introducing an ad
able, although it has not actually been falsified. hoc hypothesis every time negative evidence
Falsifiability requires only that the conditions appeared. Indeed, this could even be done if nega-
under which a statement would be deemed false tive evidence appeared against the ad hoc hypothesis
are specifiable; it does not require that they have itself; another ad hoc hypothesis could be intro-
actually come about. However, when this happens, duced to explain the failure and so on ad infinitum.
Popper assumed scientists would respond with Arguments of this type raised the possibility that fal-
a new and better conjecture. Scientific methodol- sifiability might be an unattainable goal, just as veri-
ogy should not attempt to avoid mistakes, but fiability had been for the logical positivists.
rather, as Popper famously put it, it should try to Two general responses to these difficulties app-
make its mistakes as quickly as possible. Scientific eared. In 1962, Thomas Kuhn argued in The
progress results from this sequence of conjectures Structure of Scientific Revolutions that falsification
and refutations, with each new conjecture requir- occurred only during periods of cumulative normal
ing the precise grounds of specification for its science, whereas the more important noncumula-
failure to satisfy the principle of falsifiability. Pop- tive changes, or revolutions, depended on factors
per’s image of science achieved great popularity that went beyond failures of observation or experi-
among working scientists, and he was acknowl- ment. Kuhn made extensive use of historical evi-
edged by several Nobel prize winners (including dence in his arguments. In reply, Lakatos shifted
Peter Medewar, John Eccles, and Jacques Monod). the unit of appraisal in scientific methodology,
In The Logic of Scientific Discovery, Popper, from individual statements of law or theory to
like the logical positivists, presented the view that a historical sequence of successive theories called
scientific theories ideally took the form of logically a research program. Such programs were to be
independent and consistent systems of axioms appraised according to whether new additions, ad
from which (with the addition of initial condi- hoc or otherwise, increased the overall explanatory
tions) all other scientific statements followed by scope of the program (and especially covered pre-
logical deduction. However, in an important later viously unexplained facts) while retaining the
730 Logistic Regression
successful content of earlier theories. In addition political party identification (Democrat, Republi-
to The Logic of Scientific Discovery, Popper’s main can, other, or none); or (c) ordered polytomous,
ideas are presented in the essays collected in Con- which is an ordinal scale variable with three or
jectures and Refutations and Objective Knowl- more categories, for example, level of education
edge. Two books on political philosophy, The completed (e.g., less than elementary school, ele-
Open Society and Its Enemies and The Poverty of mentary school, high school, an undergraduate
Historicism were also important in establishing his degree, or a graduate degree). Here, the basic
reputation. A three-volume Postscript to the Logic logistic regression model for dichotomous out-
of Scientific Discovery, covering respectively, real- comes is examined, noting its extension to polyto-
ism, indeterminism, and quantum theory, appeared mous outcomes and its conceptual roots in both
in 1982. loglinear analysis and the general linear model.
Next, consideration is given to methods
Peter Barker for assessing the goodness of fit and predictive
utility of the overall model, and calculation and
See also Hypothesis; Scientific Method; Significance
interpretation of logistic regression coefficients and
Level, Concept of
associated inferential statistics to evaluate the
importance of individual predictors in the model.
Further Readings The discussion throughout the entry assumes an
interest in prediction, regardless of whether causal-
Hempel, C. G., & Oppenheim, P. (1948). Studies in
the logic of explanation. Philosophy of Science, 15,
ity is implied; hence, the language of ‘‘outcomes’’
135175. and ‘‘predictors’’ is preferred to the language of
Kuhn, T. S. (1962). The structure of scientific revolutions. ‘‘dependent’’ and ‘‘independent’’ variables.
Chicago: Chicago University Press. The equation for the logistic regression model
Lakatos, I. (1978). The methodology of scientific research with a dichotomous outcome is
programmes. Cambridge, UK: Cambridge University
Press. logitðYÞ ¼ α þ β1 X1 þ β2 X2 þ þ βK XK ,
Popper, K. R. (1945). The open society and its enemies.
London: George Routledge and Sons. where Y is the dichotomous outcome; logit(Y) is
Popper, K. R. (1957). The poverty of historicism. the natural logarithm of the odds of Y, a transfor-
London: Routledge and Kegan Paul. mation of Y to be discussed in more detail momen-
Popper, K. R. (1959). The logic of scientific discovery tarily; and there are k ¼ 1; 2; . . . ; K predictors Xk
(Rev. ed.). New York: Basic Books.
with associated coefficients βk, plus a constant or
Popper, K. R. (1972). Objective knowledge: An
intercept α, which represents the value of logit(Y)
evolutionary approach. Oxford, UK: Clarendon Press.
Popper, K. R. (1982). Postscript to the logic of scientific when all of the Xk are equal to zero. If the two
discovery (W. W. Bartley III, Ed.). Totowa, NJ: categories of the outcome are coded 1 and 0,
Rowman and Littlefield. respectively, and P1 is the probability of being in
the category coded as 1, and P0 is the probability
of being in the category coded as 0, then the odds
of being in category 1 are
LOGISTIC REGRESSION
P1 =P0 ¼ P1 =ð1 P1 Þ
Logistic regression is a statistical technique used in (because the probability of being in one category
research designs that call for analyzing the rela- is one minus the probability of being in the other
tionship of an outcome or dependent variable to category). Logit(Y) is the natural logarithm of
one or more predictors or independent variables the odds,
when the dependent variable is either (a) dichoto-
mous, having only two categories, for example, ln½P1 =ð1 P1 Þ,
whether one uses illicit drugs (no or yes); (b) unor-
dered polytomous, which is a nominal scale vari- where ln represents the natural logarithm
able with three or more categories, for example, transformation.
Logistic Regression 731
Polytomous Logistic Regression Models should not affect estimates for categories other
than the categories that are actually split or com-
When the outcome is polytomous, logistic bined. This property is not characteristic of
regression can be implemented by splitting the other ordinal contrasts. It is commonly assumed
outcome into a set of dichotomous variables. in ordinal logistic regression that only the inter-
This is done by means of contrasts, which iden- cepts (or thresholds, which are similar to inter-
tify a reference category (or set of categories) cepts) differ across the logit functions. The
with which to compare each of the other cate- ordinal logistic regression equation can be writ-
gories (or sets of categories). For a nominal out- ten (here in the format using intercepts instead
come, the most commonly used model is called of thresholds) as
the baseline category logit model. In this model,
the outcome is divided into a set of dummy vari- logitðYm Þ ¼ αm þ β1 X1 þ β2 X2 þ þ βK XK ,
ables, each representing one of the categories of
the outcome, with one of the categories desig- where
nated as the reference category, in the same way
that dummy coding is used for nominal predic- αm ¼ α1 , α2 ; . . . ; αM1
tors in linear regression. If there are M categories
in the outcome, then are the intercepts associated with the M 1 logit
functions, but β1 , β2 , . . . , βK are assumed to be
logitðYm Þ ¼ lnðPm =P0 Þ ¼ αm þ β1;m X1 þ β2;m X2 identical for the M 1 logit functions. This
þ þ βK;m XK ; assumption can be tested and, if necessary,
modified.
where P0 is the probability of being in the refer-
ence category and Pm is the probability of being Logistic Regression, Loglinear Analysis,
in category m ¼ 1, 2, . . . , M 1, given that the and the General Linear Model
case is either in category m or in the reference
category. A total of M 1 equations or logit Logistic regression can be derived from two dif-
functions are thus estimated, each with its own ferent sources, the general linear model for linear
intercept αm and logistic regression coefficients regression and the logit model in loglinear analy-
βk,m, representing the relationship of the predic- sis. Linear regression is used to analyze the rela-
tors to logit(Ym ). tionship of an outcome to one or more
For ordinal outcomes, the situation is more predictors when the outcome is a continuous
complex, and several different contrasts might interval or ratio scale variable. Linear regression
be used. In the adjacent category logit model, for is used extensively in the analysis of outcomes
example, each category is contrasted only with with a natural metric, such as kilograms, dollars,
the single category preceding it. In the cumula- or numbers of people, where the unit of mea-
tive logit model, (a) for the first logit function, surement is such that it makes sense to talk
the first category is contrasted with all of the about larger or smaller differences between cases
categories following it, then (b) for the second (the difference between the populations of
logit function, the first two categories are con- France and Germany is smaller than the differ-
trasted with all of the categories following them, ence between the populations of France and
and so forth, until for the last (M 1) logit func- China). Usually, it also makes sense to talk about
tion, all the categories preceding the last are con- one value being some number of times larger
trasted with the last category. Other contrasts than another ($10,000 is twice as much as
are also possible. The cumulative logit model is $5,000); these comparisons are not applicable to
the model most commonly used in logistic the categorical outcome variables for which
regression analysis for an ordinal outcome, and logistic regression is used. The equation for lin-
it has the advantage over other contrasts that ear regression is
splitting or combining categories (representing
more precise or cruder ordinal measurement) Y ¼ α þ β 1 X1 þ β 2 X2 þ þ β K XK ,
732 Logistic Regression
and the only difference from the logistic regression logistic regression, and loglinear and logit models
equation is that the outcome in linear regression are commonly estimated using iterative maximum
is Y instead of logit(Y). The coefficients βK and likelihood (ML) estimation, in which one begins
intercept α in linear regression are most commonly with a set of initial values for the coefficients in the
estimated using ordinary least-squares (OLS) esti- model, examines the differences between observed
mation, although other methods of estimation are and predicted values produced by the model (or
possible. some similar criterion), and uses an algorithm to
For OLS estimation and for statistical inferences adjust the estimates to improve the model. This
about the coefficients, certain assumptions are process of estimation and adjustment of esti-
required, and if the outcome is a dichotomy (or mates is repeated in a series of steps (iterations)
a polytomous variable represented as a set of that end when, to some predetermined degree of
dichotomies) instead of a continuous interval/ratio precision, there is no change in the fit of the
variable, several of these assumptions are violated. model, the coefficients in the model, or some
For a dichotomous outcome, the predicted values similar criterion.
might lie outside the range of possible values (sug- Logistic regression can be viewed either as a spe-
gesting probabilities greater than one or less than cial case of the general linear model involving the
zero), especially when there are continuous interval logit transformation of the outcome or as an
or ratio scale predictors in the model. Inferential extension of the logit model to incorporate contin-
statistics are typically incorrect because of hetero- uous as well as categorical predictors. The basic
scedasticity (unequal residual variances for differ- form of the logistic regression equation is the same
ent values of the predictors) and non-normal as for the linear regression equation, but the out-
distribution of the residuals. It is also assumed that come logit(Y) has the same form as the outcome in
the relationship between the outcome and the logit analysis. The use of the logit transforma-
predictors is linear; however, in the general linear tion ensures that predicted values cannot exceed
model, it is often possible to linearize a nonlinear observed values (for an individual case, the logit of
relationship by using an appropriate nonlinear Y is either positive or negative infinity, þ ∞ or
transformation. For example, in research on ∞), but it also makes it impossible to estimate
income (measured in dollars), it is commonplace to the coefficients in the logistic regression equation
use the natural logarithm of income as an outcome, using OLS. Estimation for logistic regression, as
because the relationship of income to its predictors for logit analysis, requires an iterative technique,
tends to be nonlinear (specifically, logarithmic). In most often ML, but other possibilities include iter-
this context, the logit transformation is just one of atively reweighted least squares, with roots in the
many possible linearizing transformations. general linear model, or some form of quasi-likeli-
An alternative to the use of linear regression to hood or partial likelihood estimation, which might
analyze dichotomous and polytomous categorical be employed when data are clustered or noninde-
outcomes is logit analysis, which is a special case pendent. Common instances of nonindependent
of loglinear analysis. In loglinear analysis, it is data include multilevel analysis, complex sampling
assumed that the variables are categorical and can designs (e.g., multistage cluster sampling), and
be represented by a contingency table with as designs involving repeated measurement of the
many dimensions as there are variables, with each same subjects or cases, as in longitudinal research.
case located in one cell of the table, corresponding Conditional logistic regression is a technique for
to the combination of values it has on all of the analyzing related samples, for example, in
variables. In loglinear analysis, no distinction is matched case-control studies, in which, with some
made between outcomes and predictors, but in minor adjustments, the model can be estimated
logit analysis, one variable is designated as the out- using ML.
come, and the other variables are treated as predic-
tors. Each unique combination of values of the
Assumptions of Logistic Regression
predictors represents a covariate pattern. Logit
model equations are typically presented in a format Logistic regression assumes that the functional
different from that used in linear regression and form of the equation is correct, and hence, the
Logistic Regression 733
predictors Xk are linearly and additively related to tested model, the model for which the coefficients
logit(Y), but variables can be transformed to adjust are actually estimated DM . DM , which is some-
for nonadditivity and nonlinearity (e.g., nonli- times called the deviance statistic, has been used as
nearly transformed predictors or interaction a goodness-of-fit statistic, but it has somewhat
terms). It also assumes that each case is indepen- fallen out of favor because of concerns with alter-
dent of all the other cases in the sample, or when native possible definitions for the saturated model
cases are not independent, adjustments can be (depending on whether individual cases or covari-
made in either the estimation procedure or the cal- ate patterns are treated as the units of analysis),
culation of standard errors (or both) to adjust for and the concern that, for data in which there are
the nonindependence. Like linear regression, logis- few cases per covariate pattern, DM does not
tic regression assumes that the variables are mea- really have a chi-square distribution. The Hos-
sured without error, that all relevant predictors are merLemeshow goodness-of-fit index is con-
included in the analysis (otherwise the logistic structed by grouping the data, typically into
regression coefficients might be biased), and that deciles, based on predicted values of the outcome.
no irrelevant predictors are included in the analysis This technique is applicable even with few cases
(otherwise standard errors of the logistic regression per covariate pattern. There seems to be a trend
coefficients might be inflated). Also as in linear away from concern with goodness of fit, however,
regression, no predictor may be perfectly collinear to focus instead on the model chi-square statistic,
with one or more of the other predictors in the
model. Perfect collinearity means that a predictor GM ¼ D0 DM ,
is completely determined by or predictable from
which compares the tested model to the model
one or more other predictors, and when perfect
with no predictors. GM generally does follow
collinearity exists, an infinite number of solutions
a chi-square distribution in large samples and it is
is available that maximize the likelihood in ML
analogous to the multivariate F statistic in linear
estimation or minimize errors of prediction more
regression and analysis of variance. GM provides
generally. Logistic regression also assumes that the
a test of the statistical significance of the overall
errors in prediction have a binomial distribution,
model in predicting the outcome. An alternative to
but when the number of cases is large, the bino-
GM for models not estimated using ML is the mul-
mial distribution approximates the normal distri-
tivariate Wald statistic.
bution. Various diagnostic statistics have been
There is a substantial literature on coefficients
developed and are readily available in existing
of determination for logistic regression, in which
software to detect violations of assumptions and
the goal is to find a measure analogous to R2 in
other problems (e.g., outliers and influential cases)
linear regression. When the concern is with how
in logistic regression.
close the predicted probabilities of category
membership are to observed category membership
Goodness of Fit and Accuracy of Prediction (quantitative prediction), two promising options
are the likelihood ratio R2 statistic,
In logistic regression using ML (currently the most
commonly used method of estimation), in place of R2L ¼ GM =D0 ,
the sum of squares statistics used in linear regres-
sion, there are log likelihood statistics, which are which is applicable specifically when ML estima-
calculated based on observed and predicted proba- tion is used, and the OLS R2 statistic itself, which
bilities of being in the respective categories of the is calculated by squaring the correlation between
outcome variable. When multiplied by 2, the dif- observed values (coded zero and one) and the pre-
ference between two log likelihood statistics has dicted probabilities of being in category 1. Advan-
an approximate chi-square distribution for suffi- tages of R2L include the following: (a) it is based on
ciently large samples involving independent obser- the quantity actually being maximized in ML esti-
vations. One can construct 2 log likelihood mation, (b) it seems to be uncorrelated with the
statistics (here and elsewhere designated as D) for base rate (the percentage of cases in category 1),
(a) a model with no predictors D0 and (b) the and (c) it can be calculated for polytomous as well
734 Logistic Regression
as dichotomous outcomes. Other R2 analogs have statistical significance for unstandardized logistic
been proposed but have various problems that regression coefficients. The univariate Wald statis-
include correlation with the base rate (to the tic can be calculated either as the ratio of the logis-
extent that the base rate itself seems to determine tic regression coefficient to its standard error (SE),
the calculated accuracy of prediction), having no
reasonable value for perfect prediction or for per- bk =SEðbk Þ,
fectly incorrect prediction, or being limited to
which has an approximate normal distribution, or
dichotomous outcomes.
[bk/SE(bk)]2, which has an approximate chi-square
Alternatively, instead of being concerned with
distribution. The Wald statistic, however, tends to
predicted probabilities, one might be concerned
be problematic for large bk , tending to fail to reject
with how accurately cases are qualitatively classi-
the null hypothesis when the null hypothesis is
fied into the categories of the outcome by the pre-
false (Type II error), but it might still be the best
dictors (qualitative prediction). For this purpose,
available option when ML is not used to estimate
there is a family of indices of predictive efficiency,
the model. Alternatives include the score statistic
which is designated lambda-p, tau-p, and phi-p,
and the likelihood ratio statistic (the latter being
that are specifically applicable to qualitative pre-
the difference in DM with and without Xk in the
diction, classification, and selection tables (regard-
equation). When ML estimation is used, the likeli-
less of whether they were generated by logistic
hood ratio statistic, which has a chi-square distri-
regression or some other technique), as opposed to
bution and applies to both bk and ωk, is generally
contingency tables more generally. Finally, none of
the preferred test of statistical significance for bk
the aforementioned indices of predictive efficiency
and ωk.
(or R2 analogs) takes into account the ordering in
Unless all predictors are measured in exactly
an ordered polytomous outcome, for which one
the same units, neither bk nor ωk clearly indicates
would naturally consider ordinal measures of asso-
whether one variable has a stronger impact on the
ciation. Kendall’s tau-b is an ordinal measure of
outcome than another. Likewise, the statistical sig-
association that, when squared (τ b2), has a propor-
nificance of bk or ωk tells us only how sure we are
tional reduction in error (PRE) interpretation, and
that a relationship exists, not how strong the
it seems most promising for use with ordinal out-
relationship is. In linear regression, to compare the
comes in logistic regression. Tests of statistical
substantive significance (strength of relationship,
significance can be computed for all these coeffi-
which does not necessarily correspond to statistical
cients of determination.
significance) of predictors measured in different
units, we often rely on standardized regression
coefficients. In logistic regression, there are several
Unstandardized and Standardized alternatives for obtaining something like a stan-
Logistic Regression Coefficients dardized coefficient. A relatively quick and
easy option is simply to standardize the predictors
Interpretation of unstandardized logistic regression
(standardizing the outcome does not matter,
coefficients (bk , the estimated value of βk) is stra-
because it is the probability of being in a particular
ightforward and parallel to the interpretation of
category of Y, not the actual value of Y, that is pre-
unstandardized coefficients in linear regression: A
dicted in logistic regression). A slightly more com-
one-unit increase in Xk is associated with a bk
plicated approach is to calculate
increase in logit(Y) (not in Y itself). If we raise the
base of the natural logarithm, e ¼ 2.718 . . . ; to bk ¼ ðbk Þðsx ÞðRÞ=slogitðYÞ ;
the power bk, we obtain the odds ratio, here desig-
nated ωk, which is sometimes presented in place of where bk* is the fully standardized logistic regres-
or in addition to bk and can be interpreted as indi- sion coefficient, bk is the unstandardized logistic
cating that a one-unit increase in Xk multiplies the regression coefficient, sx is the standard deviation
odds of being in category 1 by ωk. Both bk and ωk of the predictor Xk, R is the correlation between
convey exactly the same information, just in a dif- the observed value of Y and the predicted probabil-
ferent form. There are several possible tests of ity of being in category 1 of Y, slogit(Y) is the
Loglinear Models 735
standard deviation of the predicted values of Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic
logit(Y), and the quantity slogit(Y)/R represents the regression (2nd ed.). New York: John Wiley.
estimated standard deviation in the observed McCullagh, P., & Nelder, J. A. (1989). Generalized
values of logit(Y) (which must be estimated, linear models (2nd ed.). London: Chapman & Hall.
Menard, S. (2000). Coefficients of determination for
because the observed values are positive or nega-
multiple logistic regression analysis. The American
tive infinity for any single case). The advantage to Statistician, 54, 1724.
this fully standardized logistic regression coeffi- Menard, S. (2002). Applied logistic regression analysis
cient is that it behaves more like the standardized (2nd ed.). Thousand Oaks, CA: Sage.
coefficient in linear regression, including showing Menard, S. (2004). Six approaches to calculating
promise for use in path analysis with logistic standardized logistic regression coefficients. The
regression. This technique is currently under devel- American Statistician, 58, 218223.
opment. Also, parallel to the use of OLS regression Menard, S. (2008). Panel analysis with logistic regression.
or more sophisticated structural equation model- In S. Menard (Ed.), Handbook of longitudinal
research: Design, measurement, and analysis. San
ing techniques in linear panel analysis, it is possi-
Francisco: Academic Press.
ble to use logistic regression in panel analysis; once
O’Connell, A. A. (2006). Logistic regression models for
one decides on an appropriate way to measure ordinal response variables. Thousand Oaks, CA: Sage.
change in the linear panel analysis, the application Pregibon, D. (1981). Logistic regression diagnostics.
of logistic regression is straightforward. Annals of Statistics, 9, 705724.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical
linear models: Applications and data analysis methods
Logistic Regression and Its Alternatives (2nd ed.). Thousand Oaks, CA: Sage.
Simonoff, J. S. (1998). Logistic regression, categorical
Alternatives to logistic regression include probit predictors, and goodness of fit: It depends on who you
analysis, discriminant analysis, and models pra- ask. The American Statistician, 52, 1014.
ctically identical to the logistic regression model
but with different distributional assumptions (e.g.,
complementary log-log or extreme value instead of
logit). Logistic regression, however, has increasingly LOGLINEAR MODELS
become the method most often used in empirical
research. Its broad applicability to different types of This entry provides a nontechnical description of
categorical outcomes and the ease with which it loglinear models, which were developed to analyze
can be implemented in statistical software algo- multivariate cross-tabulation tables. Although
rithms, plus its apparent consistency with realistic a detailed exposition is beyond its scope, the entry
assumptions about real-world empirical data, have describes when loglinear models are necessary,
led to the widespread use of logistic regression in what these models do, how they are tested, and
the biomedical, behavioral, and social sciences. the more familiar extensions of binomial and mul-
Scott Menard tinomial logistic regression.
Many popular statistics assume the dependent Thus the following dilemma: Many variables
or criterion variable is numeric (e.g., years of for- researchers would like to explain are non-numeric.
mal education). What can the analyst investigating Using OLS statistics to analyze them can produce
a nominal dependent variable do? There are sev- nonsensical or misleading results. Some common
eral techniques for investigating a nominal depen- methods taught in early statistics classes (e.g.,
dent variable, many of which are discussed in the three-way cross tabulations) are overly restrictive
next section. (Those described in this entry can or lack tests of statistical significance. Other
also be used with ordinal dependent variables. The techniques (e.g., LPM) have many unsatisfactory
categories of an ordinal variable can be rank outcomes.
ordered from highest to lowest, or most to least.) Loglinear models were developed to address
One alternative is logistic regression. However, these issues. Although these models have a rela-
many analysts have learned binomial logistic tively long history in statistical theory, their practi-
regression using only dichotomous or ‘‘dummy’’ cal application awaited the use of high-speed
dependent variables scored 1 or 0. Furthermore, computers.
the uninitiated interpret logistic regression coeffi-
cients as if they were ordinary least squares (OLS)
What Is a Loglinear Model?
regression coefficients. A second analytic possibil-
ity uses three-way cross-tabulation tables and con- Technically, a loglinear model is a set of specified
trol variables with nonparametric statistical parameters that generates a multivariate cross-
measures. This venerable tradition of ‘‘physical’’ tabulation table of expected frequencies or table
(rather than ‘‘statistical’’) control presents its own cell counts. In the general cell frequency (GCF)
problems, as follows: loglinear model, interest centers on the joint
and simultaneous distribution of several vari-
• Limited inference tests for potential three-
ables in the table cells. The focus includes rela-
variable statistical interactions.
tionships among independent variables as well as
• Limiting the analysis to an independent, those between an independent and a dependent
dependent, and control variable. variable.
• There is no ‘‘system’’ to test whether one variable Table 1 is a simple four-cell (2 × 2) table using
affects a second indirectly through a third 2008 General Social Survey data (NORC at the
variable; for example, education usually University of Chicago), which is an in-person rep-
influences income indirectly through its effects resentative sample of the United States. Table 1
on occupational level. compares 1,409 male and female adults on the per-
• The three-variable model has limited utility for centage who did or did not complete a high-school
researchers who want to compare several causes chemistry course.
of a phenomenon. Although men reported completing high-school
chemistry more than women by 8%, these
A third option is the linear probability model results could reflect sampling error (i.e., they are
(LPM) for a dependent dummy variable scored 1 a ‘‘sample accident’’ not a ‘‘real’’ population sex
or 0. In this straightforward, typical OLS regres- difference).
sion model, B coefficients are interpreted as raising
or lowering the probability of a score of 1 on the
dependent variable. Table 1 Respondent Completed High-School
However, the LPM, too, has several problems. Chemistry Course by Sex
The regression often suffers from heteroscedasti- Completed High-School
city in which the dependent variable variance Chemistry Course Sex
depends on scores of the independent variable(s). Male Female
The dependent variable variance is truncated (at
Yes 55.5% 47.5%
a maximum 0.25.) The LPM can predict impossi-
No 45.5 52.5
ble values for the dependent variable that are
Total 100.0% (869) 100.0% (540)
larger than 1 or less than 0.
Loglinear Models 737
The loglinear analyst compares the set of gener- a multivariate cross-tabulation table. Negative
ated or expected table cell frequencies with the set parameters mean fewer cell frequencies than
of observed table cell counts. If the two sets of cell would occur with a predicted no-effects model.
counts coincide overall within sampling error, the Positive parameters mean higher cell counts than
analyst says, ‘‘the model fits.’’ If the deviations a no-effects model would predict.
between the two sets exceed sampling error, the Parameters in loglinear models (and by exten-
model is a ‘‘poor fit.’’ Under the latter circum- sion their cousins, logistic regression and logit
stances, the analyst must respecify the parameters models) are maximum likelihood estimators
to generate new expected frequencies that more (MLEs). Unlike direct estimates such as OLS coef-
closely resemble the observed table cell counts. ficients in linear regression, MLEs are solved
Even in a two-variable table, more than one through iterative, indirect methods. Reestimating
outcome model is possible. One model, for exam- MLEs, which can take several successively closer
ple, could specify that in the American population, reestimate cycles, is why high-speed computers are
males and females completed a chemistry course at needed.
equal rates; thus, in this sample, we would predict
that 52.5% of each sex completed high-school
chemistry. This outcome is less complicated than A Basic Building Block of
one specifying sex differences: If females and simi-
Loglinear Models: The Odds Ratio
larly completed high-school chemistry, then
explaining a sex difference in chemistry exposure The odds ratio is formed by the ratio of one cell
is unnecessary. count in a variable category to a second cell count
Table 2 shows expected frequency counts for for the same variable, for example, the U.S. ratio
this ‘‘no sex differences’’ model for each cell above of males to females. Compared with the focus on
the diagonal (with the actual observed frequencies the entire table in GCF models, this odds ratios
in bold below it). Thus, when calculating expected subset of loglinear models focuses on categories of
frequencies, sample males and females were the dependent variable (categories in the entire
assumed to have 52.5% high-school chemistry table, which the loglinear model examines, are
completion rates. The table has been constrained used to calculate the odds ratios, but the emphasis
to match the overall observed frequencies for gen- is on the dependent variable and less on the table
der and chemistry course exposure. as a whole). In Table 2, 740 adults completed
Comparing expected and observed cell counts, a chemistry course and 669 did not, making the
males have fewer expected than observed cases odds ratio or odds yes:no 740/669 or 1.11. An
completing chemistry, whereas females have odds ratio of 1 would signify a 5050 split on
greater expected than observed cases completing completing high-school chemistry for the entire
high-school chemistry. sample.
Statistically significant GCF coefficients increase In a binary odds, one category is designated as
or decrease the predicted (modeled) cell counts in a ‘‘success’’ (‘‘1’’), which forms the odds numera-
tor, and the second as a ‘‘failure’’ (‘‘0’’), which
forms the ratio denominator. These designations
Table 2 Expected and Observed Frequencies for do not signify any emotive meaning of ‘‘success.’’
High-School Chemistry Course by Sex For example, in disease death rates, the researcher
Completed High-School might designate death as a success and recovery as
Chemistry Course Sex a failure. The odds can vary from zero (no suc-
Male Female Total cesses) to infinity; they are undefined when the
denominator is zero. The odds are fractional when
Yes 456/ 284/
there are more failures than successes; for exam-
483 257 740
ple, if most people with a disease survive, then the
No 413/ 256/
odds would be fractional.
386 283 669
A first-order conditional odds considers one
Total 869 540 1409
independent variable as well as scores on the
738 Loglinear Models
dependent variable. The observed first-order chem- to zero—signifying no sex effect on high-school
istry conditional for males in Table 2 (yes:no) is chemistry completion.
483/386 ¼ 1.25, and for females it is 257/283 ¼
0.91. Here, the first-order conditional indicates
Testing Loglinear Models
that males more often completed chemistry than
not; however, women completed chemistry less Although several models might be possible in the
often than successfully completed it. same table of observed data, not all models will
A second-order odds of 1 designates statistical replicate accurately the observed table cell counts
independence; that is, changes in the distribution within sampling error. A simpler model that fits
of the second variable are not influenced by any the data well (e.g., equal proportions of females
systematic change in the distribution of the first and males completed high-school chemistry) is
variable. Second-order odds ratios departing usually preferred to one more complex (e.g.,
from 1 indicate two variables are associated. males more often elect chemistry than females).
Here, the second-order odds (males:females) of Loglinear and logit models can have any number
the two first-order conditionals on the chemistry of independent variables; the interrelationships
course is 1.25/0.91 or 1.37. Second-order odds among those and with a dependent variable can
greater than 1 indicate males completed a chem- quickly become elaborate. Statistical tests esti-
istry course more often than females, whereas mate how closely the modeled and observed data
fractional odds would signify women more often coincide.
completed chemistry. By extension with more Loglinear and logit models are tested for sta-
variables, third, fourth, or higher order odds tistical significance with a likelihood ratio chi-
ratios can be calculated. square statistic, sometimes designated G2 or L2,
The natural logarithm (base e, or Euler’s con- distinguishing it from the familiar Pearson chi-
stant, abbreviated as ln) of the odds is a logit. The square (χ2). This multivariate test of statistical
male first-order logit on completing chemistry is ln significance is one feature that turns loglinear
1.25 or 0.223; for females, it is ln 0.91 ¼ .094. analysis into a system, which is comparable with
Positive logits signify more ‘‘successes’’ than ‘‘fail- an N-way analysis of variance or multiple
ures,’’ whereas negative logits indicate mostly fail- regression as opposed to physical control and
ures. Unlike the odds ratio, logits are symmetric inspecting separate partial cross tabulations.
around zero. An overwhelming number of ‘‘fail- One advantage of the logarithmic L2 statistic is
ures’’ would produce a large negative logit. Logits that it is additive: The L2 can be partitioned with
of 0 indicate statistical independence. portions of it allocated to different pieces of
Logits can be calculated on observed or mod- a particular model to compare simpler with
eled cell counts. Analysts more often work with more complicated models on the same cross-tab-
logits when they have designated a dependent ulation table.
variable. Original model effects, including logits, Large L2s imply sizable deviations between the
are multiplicative and nonlinear. Because these modeled and observed data, which means the log-
measures were transformed through logarithms, linear model does not fit the observed cell counts.
they become additive and linear, hence the term The analyst then adds parameters (e.g., a sex dif-
loglinear. ference on high school chemistry) to the loglinear
Loglinear parameters for the cross-tabulation equation to make the modeled and observed cell
table can specify univariate distributions and two frequencies more closely resemble each other. The
variable or higher associations. In the Table 2 inde- most complex model, the fully saturated model,
pendence model, parameters match the observed generates expected frequencies that exactly
total case base (n) and both univariate distribu- match the observed cell frequencies (irrespective of
tions exactly. The first-order odds for females and the number of variables analyzed). The saturated
males are set to be identical, forcing identical per- model always fits perfectly with an L2 ¼ 0.
centages on the chemistry question (here 52.5%) The analyst can test whether a specific parame-
for both sexes. The second-order odds (i.e., the ter or effect (e.g., a sex difference on high-school
odds of the first-order odds) are set to 1 and its ln chemistry) must be retained so the model fits or
Loglinear Models 739
whether it can be dropped. The parameter of inter- Any time we ‘‘fix’’ a parameter, that is, specify
est is dropped from the equation; model cell counts that the expected and observed cell counts or vari-
are reestimated and the model is retested. If the able totals must match for that variable or associa-
resulting L2 is large, the respecified model is a poor tion, we lose df. A fully saturated model specifying
fit and the effect is returned to the loglinear equa- a perfect match for all cells has zero df. The model
tion. If the model with fewer effects fits, the analyst fits but might be more complex than we would
next examines which additional parameters can be like.
dropped. In addition to the L2, programs such as
SPSS, an IBM product (formerly called PASW Sta-
tistics), report z scores for each specified parameter
to indicate which parameters are probably neces- Extensions and Uses of Loglinear Models
sary for the model.
Logit and logistic regression models are derived
Most models based on observed data are
from combinations of cells from an underlying
hierarchical; that is, more complex terms contain
GCF model. When the equations for cell counts are
all lower order terms. For example, in the sex-
converted to odds ratios, terms describing the distri-
chemistry four-cell table, a model containing a sex
butions of and associations among the independent
by chemistry association would also match the
variables cancel and drop from the equation, leav-
sex distribution, the chemistry distribution (match-
ing only the split on the dependent variable and the
ing the modeled univariate distribution on the
effects of independent variables on the dependent
chemistry course to the observed split in the vari-
variable. Because any variable, including a depen-
able), and the case base n to the observed sample
dent variable, in a GCF model can have several
size. Nonhierarchical models can result from some
categories, the dependent variable in logistic regres-
experimental designs (equal cases for each treat-
sion can also have several categories. This is multi-
ment group) or disproportionate sampling designs.
nomial logistic regression and it extends the more
Final models are described through two alternative
familiar binomial logistic regression.
terminologies. The saturated hierarchical model
Of the possible loglinear, logit, and logistic
for Tables 1 and 2 could be designated as (A*B) or
regression models, the GCF model allows the most
as {AB}. Either way, this hierarchical model would
flexibility, despite its more cumbersome equations.
include parameters for n, the A variable, the B var-
Associations among all variables, including inde-
iable, and the AB association. For hierarchical
pendent variables, can easily be assessed. The
models, lower order terms are assumed included in
analyst can test path-like causal models and check
the more complex terms. For nonhierarchical mod-
for indirect causal effects (mediators) and statisti-
els, the analyst must separately specify all required
cal interactions (moderators) more readily than
lower order terms.
in extensions of the GCF model, such as logit
models.
Although the terminology and underlying
Degrees of Freedom premises of the loglinear model might be unfamil-
iar to many analysts, it provides useful ways of
L2 statistics are evaluated for statistical significance
analyzing nominal dependent variables that could
with respect to their associated degrees of freedom
not be done otherwise. Understanding loglinear
(df). The df in loglinear models depend on the
models also helps to describe correctly the loga-
number of variables, the number of categories in
rithmic (logit) or multiplicative and exponentiated
each variable, and the effects the model specifies.
(odds ratios) extensions in logistic regression, giv-
The total df depends on the total number of
ing analysts a systemic set of tools to understand
cells in the table. In the saturated 2 × 2 (four-cell)
the relationships among non-numeric variables.
table depicted in Tables 1 and 2, each variable has
two categories. The case base (n) counts as 1 df, Susan Carol Losh
the variables sex and the chemistry course each
have 21 df, and the association between sex and See also Likelihood Ratio Statistic; Logistic Regression;
the chemistry course has (21)*(21) or 1 df. Nonparametric Statistics; Odds Ratio
740 Longitudinal Design
former participants in a study of child develop- later life. Population average trends describe
ment or of army inductees, who were assessed in aggregate population change. Trends can be based
childhood or young adulthood. Intensive measure- on between-person differences in age-heteroge-
ment designs, such as daily diary studies and burst neous studies (although confounded with differ-
measurement designs, are based on multiple assess- ences related to birth cohort and sample selection
ments within and across days, permitting the anal- associated with attrition and mortality) or on
ysis of short-term variation and change. direct estimates of within-person change in studies
A relevant design dimension associated with with longitudinal follow-up, in which case they
longitudinal studies is breadth of measurement. can be made conditional on survival. Between-per-
Given the high financial, time, and effort costs it is son age differences can be analyzed in terms of
not unusual for a longitudinal study to be multidis- shared age-related variance in variance decomposi-
ciplinary, or at least to attempt to address a signifi- tion and factor models. This approach to under-
cant range of topics within a discipline. While standing aging, however, confounds individual
some studies have dealt with the cost issue by differences in age-related change with average age
maintaining a very narrow focus, these easily rep- differences (i.e., between-person age trends),
resent the minority. cohort influences, and mortality selection.
Longitudinal models permit the identification
of individual differences in rates of change over
Levels of Analysis in Longitudinal Research
time, which avoids making assumptions of ergo-
For understanding change processes, longitudinal dicity—that age differences between individuals
studies provide many advantages relative to cross- and age changes within individuals are equiva-
sectional studies. Longitudinal data permit the lent. In these models, time can be structured in
direct estimation of parameters at multiple levels many alternative ways. It can be defined as time
of analysis, each of which is complementary to since an individual entered the study, time since
understanding population and individual change birth (i.e., chronological age), or time until or
with age. Whereas cross-sectional analyses permit since occurrence of a shared event such as retire-
between-person analysis of individuals varying ment or diagnosis of disease. Elaboration of the
in age, longitudinal follow-up permits direct evalu- longitudinal model permits estimation of associ-
ation of both between-person differences and ation among within-person rates of change in
within-person change. different outcomes, in other words, using multi-
Information available in cross-sectional and variate associations among intercepts and
longitudinal designs can be summarized in terms change functions to describe the interdependence
of seven main levels of analysis and inferential of change functions. In shorter term longitudinal
scope (shown in italics in the next section). These designs, researchers have emphasized within-per-
levels can be ordered, broadly, in terms of their son variation as an outcome and have examined
focus, ranging from the population to the individ- whether individuals who display greater vari-
ual. The time sampling generally decreases across ability relative to others exhibit this variation
levels of analysis, from decades for analysis of his- generally across different tasks. Within-person
torical birth cohort effects to days, minutes, or sec- correlations (i.e., coupling or dynamic factor
onds for assessment of highly variable within- analysis) are based on the analysis of residuals
person processes. (after separating intraindividual means and
These levels of analysis are based on a combi- trends) and provide information regarding the
nation of multiple-cohort, between-person, and correlation of within-time variation in function-
within-person designs and analysis approaches ing across variables. Each level of analysis
and all are represented by recent examples in provides complementary information regarding
developmental research. Between-cohort differ- population and individual change, and the infer-
ences, which is the broadest level, can be examined ences and interpretations possible from any sin-
to evaluate whether different historical contexts gle level of analysis have distinct and delimited
(e.g., indicated by birth cohort) have lasting effects ramifications for understanding developmental
on level and on rate of change in functioning in and aging-related change.
742 Longitudinal Design
Considerations for the Design mortality selection processes complicate both the
of Longitudinal Studies definition of an aging population and the sam-
pling procedures relevant to obtaining a represen-
The levels of analysis described previously corre- tative sample in studies of later life. Attrition in
spond roughly to different temporal and historical longitudinal studies of aging is often nonran-
(i.e., birth cohort) sampling frames and range from dom, or selective, in that it is likely to result
very long to potentially very short intervals of from mortality or declining physical and mental
assessment. The interpretation, comparison, and functioning of the participants during the period
generalizability of parameters derived from differ- of observation. This presents an important infer-
ent temporal samplings must be carefully consid- ential problem, as the remaining sample becomes
ered and require different types of designs and less and less representative of the population
measurements. The temporal characteristics of from which it originated. Generalization from
change and variation must be taken into account, the sample of continuing participants to the ini-
as different sampling intervals will generally lead tial population might become difficult to justify.
to different results requiring different interpreta- However, a major advantage of longitudinal
tions for both within and between-person pro- studies is that they contain information neces-
cesses. For example, correlations between change sary to examine the impact of attrition and
and variability over time across outcomes will mortality selection on the observed data. This
likely be different for short temporal intervals information, which is inaccessible in cross-
(minutes, hours, days, or weeks) in contrast to sectional data, is essential for valid inferences
correlations among rates of change across years, and improved understanding of developmental
the typical intervals of many longitudinal studies and aging processes.
on aging. Heterogeneity in terms of chronological age and
Measurement interval is also critical for the pre- population mortality poses analytical challenges
diction of outcome variables and for establishing for both cross-sectional and longitudinal data
evidence on leading versus lagging indicators. and is a particular challenge to studies that begin
Causal mechanisms need time for their influences with age-heterogeneous samples. Age-homogeneous
to be exerted, and the size of the effect will vary studies, where single or narrow age birth cohorts
with the time interval between the causal influence are initially sampled, provide an initially well-
and the outcome. Thus, if one statistically controls defined population that can be followed over time,
for a covariate measured at a time before it exerts permitting conditional estimates based on subse-
its causal influence, the resultant model parameters quent survival. However, initial sampling of indivi-
might still be biased by the covariate. Time-vary- duals at different ages (i.e., age-heterogeneous
ing covariates must be measured within the time samples), particularly in studies of adults and
frame in which they are exerting their influence to aging, confounds population selection processes
provide adequate representations of the causal, related to mortality. The results from longitudinal
time-dependent processes. However, deciding on studies, beginning as age-heterogeneous samples,
what an appropriate time frame might be is not an can be properly evaluated and interpreted when
easy task, and might not be informed by previous the population parameters are estimated condi-
longitudinal studies, given that the data collection tional on initial between-person age differences, as
intervals from many studies are determined by well as on mortality and attrition processes that
logistical and financial factors, rather than theore- permit inference to defined populations.
tical expectations about the timing of developmen- Incomplete data can take many forms, such as
tal processes. item or scale nonresponse, participant attrition,
and mortality within the population of interest
(i.e., lack of initial inclusion or follow-up because
Population Sampling, Attrition, and Mortality
of death). Statistical analysis of longitudinal stud-
In observational studies, representative sam- ies is aimed at providing inferences regarding the
pling is important, as random assignment to con- level and rate of change in functioning, group dif-
ditions is not possible. However, attrition and ferences, variability, and construct relations within
Longitudinal Design 743
a population, and incomplete data complicate this within-person variation, covariation, and change
process. To make appropriate population infer- (e.g., because of learning) within measurement
ences about development and change, it is impor- bursts and evaluation of change in maximal per-
tant not only to consider thoroughly the processes formance over time across measurement bursts.
leading to incomplete data (e.g., health, fatigue,
cognitive functioning), but also to obtain measure-
Selecting Measurement Instruments
ments of these selection and attrition processes to
the greatest extent possible and include them in The design of future longitudinal studies on
the statistical analysis based on either maximum aging can be usefully informed by the analysis and
likelihood estimation or multiple imputation pro- measurement protocol of existing studies. Such
cedures. Longitudinal studies might be hindered by studies, completed or ongoing, provide evidence
missing data or not being proximal to critical for informing decisions regarding optimal or essen-
events that represent or influence the process of tial test batteries of health, cognition, personality,
interest. As a consequence, some researchers have and other measures. Incorporating features of
included additional assessments triggered by a par- measurement used in previous studies, when possi-
ticular event or response. ble, would permit quantitative anchoring and
essential opportunities for cross-cohort and cross-
country comparison.
Effects of Repeated Testing
Comparable measures are essential for cross-
Retest (i.e., practice, exposure, learning, or rea- study comparison, replication, and evaluation of
ctivity) effects have been reported in several lon- generalizability of research findings. The similarity
gitudinal studies, particularly in studies on aging of a measure can vary at many levels, and within
and cognition where the expected effects are in a single nation large operational differences can be
the opposite direction. Estimates of longitudinal found. When considering cross-cultural or cross-
change might be exaggerated or attenuated depen- national data sets, these differences can be magni-
ding on whether the developmental function is fied: Regardless of whether the same measure has
increasing or decreasing with age. Complicating been used, differences are inevitably introduced
matters is the potential for improvement to occur because of language, administration, and item rele-
differentially, related to ability level, age, or task vance. A balance must be found between optimal
difficulty, as well as to related influences such as similarity of administration, similarity of meaning,
warm-up effects, anxiety, and test-specific learning. and significance of meaning—avoiding unreason-
Intensive measurement designs, such as those able loss of information or lack of depth. These
involving measurement bursts with widely spaced challenges must clearly be addressed in a collabora-
sets of intensive measurements, are required to dis- tive endeavor, but in fact they are also critical to
tinguish short-term learning gains from long-term general development of the field, for without
aging-related changes. The typical longitudinal some means for comparing research products,
design used to estimate developmental or aging our findings lack evidence for reproducibility and
functions usually involves widely spaced intervals generalizability.
between testing occasions. Design characteristics
that are particularly sensitive to the assessment of
Challenges and Strengths
time-related processes, such as retest or learning
effects, have been termed temporal layering and Longitudinal studies are necessary for explanatory
involve the use of different assessment schedules theories of development and aging. The evidence
within longitudinal design (i.e., daily, weekly, obtained thus far from long-term longitudinal and
monthly, semiannually, or annually). For example, intensive short-term longitudinal studies indicates
one such alternative, the measurement burst remarkable within-person variation in many types
design, where assessment bursts are repeated over of processes, even those once considered highly
longer intervals, is a compromise between single- stable (e.g., personality). From both theoretical
case time series and conventional longitudinal and empirical perspectives, between-person differ-
designs, and they permit the examination of ences are a complex function of initial individual
744 Longitudinal Design
differences and intraindividual change. The iden- L. K. George (Eds.), Handbook of aging and the
tification and understanding of the sources of social sciences (6th ed., pp. 2038). San Diego, CA:
between-person differences and of developmental Academic Press.
and aging-related changes requires the direct Baltes, P. B., & Nesselroade, J. R. (1979). History and
rationale of longitudinal research. In J. R. Nesselroade
observation of within-person change available in
& P. B. Baltes (Eds.), Longitudinal research in the
longitudinal studies. study of behavior and development. New York:
There are many challenges for the design and Academic Press.
analysis of strict within-person studies and large- Hofer, S. M., Flaherty, B. P., & Hoffman, L. (2006).
sample longitudinal studies, and these will differ Cross-sectional analysis of time-dependent data:
according to purpose. The challenges of strict Problems of mean-induced association in age-
within-person studies include limits on inferences heterogeneous samples and an alternative method
given the smaller range of contexts and character- based on sequential narrow age-cohorts. Multivariate
istics available within any single individual. Of Behavioral Research, 41, 165187.
Hofer, S. M., & Hoffman, L. (2007). Statistical analysis
course, the study of relatively stable individual
with incomplete data: A developmental perspective.
characteristics and genetic differences requires
In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.),
between-person comparison approaches. In gen- Modeling ecological and contextual effects in
eral, combinations of within-person and between- longitudinal studies of human development
person population and temporal sampling designs (pp. 1332). Mahwah, NJ: Lawrence
are necessary for comprehensive understanding of Erlbaum.
within-person processes of aging because people Hofer, S. M., & Piccinin, A. M. (2009). Integrative data
differ in their responsiveness to influences of all analysis through coordination of measurement and
types, and the breadth of contextual influences analysis protocol across independent longitudinal
associated with developmental and aging out- studies. Psychological Methods, 14, 150164.
Hofer, S. M., & Sliwinski, M. J. (2006). Design and
comes is unavailable in any single individual. The
analysis of longitudinal studies of aging. In J. E. Birren
strength of longitudinal designs is that they permit
& K. W. Schaie (Eds.), Handbook of the psychology
the simultaneous examination of within-person of aging (6th ed., pp. 1537). San Diego, CA:
processes in the context of between-person vari- Academic Press.
ability, between-person differences in change, Nesselroade, J. R. (2001). Intraindividual variability in
and between-person moderation of within-person development within and between individuals.
processes. European Psychologist, 6, 187193.
Piccinin, A. M., & Hofer, S. M. (2008). Integrative
Scott M. Hofer and Andrea M. Piccinin analysis of longitudinal studies on aging:
Collaborative research networks, meta-analysis, and
See also Cross-Sectional Design; Population; Sequential optimizing future studies. In S. M. Hofer &
Design; Within-Subjects Design D. F. Alwin (Eds.), Handbook on cognitive aging:
Interdisciplinary perspectives (pp. 446476).
Thousand Oaks, CA: Sage.
Further Readings
Schaie, K. W., & Hofer, S. M. (2001). Longitudinal
Alwin, D. F., Hofer, S. M., & McCammon, R. (2006). studies of aging. In J. E. Birren & K. W. Schaie (Eds.),
Modeling the effects of time: Integrating demographic Handbook of the psychology of aging (pp. 5377).
and developmental perspectives. In R. H. Binstock & San Diego, CA: Academic Press.
M
is no significant interaction between the factors
MAIN EFFECTS and many factors are involved in the study, testing
the main effects with a factorial design likely con-
fers efficiency.
Main effects can be defined as the average differ- Plausibly, in factorial design, each factor may
ences between one independent variable (or factor) have more than one level. Hence, the significance
and the other levels of one or more independent of the main effect, which is the difference in the
variables. In other words, investigators identify marginal means of one factor over the levels of
main effects, or how one independent variable other factors, can be examined. For instance, sup-
influences the dependent variable, by ignoring or pose an education researcher is interested in know-
constraining the other independent variables in ing how gender affects the ability of first-year
a model. For instance, let us say there is a differ- college students to solve algebra problems. The
ence between two levels of independent variable A first variable is gender, and the second variable is
and differences between three levels of indepen- the level of difficulty of the algebra problems. The
dent variable B. Consequently, researchers can second variable has two levels of difficulty: difficult
study the presence of both factors separately, as in (proof of algebra theorems) and easy (solution of
single-factor experiments. Thus, main effects can simple multiple-choice questions). In this example,
be determined in either single-factor experiments the researcher uses and examines a 2 × 2 factorial
or factorial design experiments. In addition, main design. The number ‘‘2’’ represents the number of
effects can be interpreted meaningfully only if the levels that each factor has. If there are more than
interaction effect is absent. This entry focuses on two factors, then the factorial design would be
main effects in factorial design, including analysis adjusted; for instance, the factorial design may
of the marginal means. look like 3 × 2 × 2 for three factors with 3 levels
versus 2 levels and another 2-level factor. There-
fore, a total of three main effects would have to be
Main Effects in Factorial Design
considered in the study.
Factorial design is applicable whenever researchers In the previous example of 2 × 2 factorial
wish to examine the influence of a particular factor design, however, both variables are thought to
among two or more factors in their study. This influence the ability of first-year college students to
design is a method for controlling various factors solve algebra problems. Hence, two main effects
of interest in just one experiment rather than can be examined: (1) gender effects, while the level
repeating the same experiment for each of the fac- of difficulty effects is controlled and (2) level-of-
tors or independent variables in the study. If there difficulty effects, while gender effects are
745
746 Main Effects
controlled. The hypothesis also can be stated in particular factor to the random error variance.
terms of whether first-year male and female col- The larger the F ratio (i.e., the larger the relative
lege students differ in their ability to solve the variance), the more likely that the factor signifi-
more difficult algebra problems. The hypothesis cantly affects the dependent variable. To determine
can be answered by examining the simple main whether the F ratio is large enough to show that
effects of gender or the simple main effects of the main effects are significant, the researcher can
the second variable (level of difficulty). compare the F ratio with critical F by using the
critical values table provided in many statistics
textbooks. The researcher can also compare the p
Marginal Means
value in the ANOVA table with the chosen signifi-
An easy technique for checking the main effect of cance level, say .05. If p < .05, then the effect for
a factor is to examine the marginal means, or the that factor on the dependent variable is significant.
average difference at each level that makes up the The marginal means can then be interpreted from
factorial design. The differences between levels in these results, that is, which group (e.g., male vs.
a factor could preliminarily affect the dependent female, freshman vs. senior, or sophomore vs.
variable. The differences in the marginal means senior) is significantly higher or lower than the
also tell researchers how much, on average, one other groups on that factor. It is important to
level of the factor differs from the others in affect- report the F and p values, followed by the interpre-
ing the dependent variable. For instance, Table 1 tation of the differences in the marginal means of
shows the two-level main effect of gender and the a factor, especially for the significant main effects
four-level main effect of college year on IQ test on the dependent variable.
points. The marginal means from this 2 × 4 fac- The analysis of the main effects of a factor on
torial design show that there might be a main the dependent variable while other factors are
effect of gender, with an average difference of 5 controlled is used when a researcher is interested
points, on IQ test scores. Also, there might be in looking at the pattern of differences between
a main effect of college year, with differences of 5 the levels of individual independent variables.
to 22.5 points in IQ test scores across college The significant main effects give the researcher
years. To determine whether these point differ- information about how much one level of a fac-
ences are greater than what would be expected tor could be more or less over the other levels.
from chance, the significance of these mains effects The significant main effect, however, is less
needs to be tested. meaningful when the interaction effect is signifi-
The test of main effects significance for each cant, that is, when there is a significant interac-
factor is the test of between-subject effects pro- tion effect between factors A and B. In that case,
vided by the analysis of variance (ANOVA) table the researcher should test the simple main effects
found in many statistical software packages, such instead of the main effects on the dependent
as the SPSS (an IBM company, formerly called variable.
PASWâ Statistics), SAS, and MINITAB. The F
ratio for the two factors—which is empirically Zairul Nor Deana Binti Md Desa
computed from the amount of variance in the
dependent variable contributed by these two See also Analysis of Variance (ANOVA); Factorial
factors—is the ratio of the relative variance of that Design; Interaction; Simple Main Effects
Mann–Whitney U Test 747
students’ rank plus their coach, is shown in the fol- very similar. For example, in the following table,
lowing table (where A indicates Coach Alba and B the two U values are actually the same:
indicates Coach Bolt):
Rank 1 2 3 4 5 6 7 8 9 10 11 U
Rank 1 2 3 4 5 6 7 8 9 10 11 Coach A B A B A B A B A B A
Coach A A A B A A B B A B B Alba 0 1 2 3 4 5 15
Bolt 1 2 3 4 5 15
n1 ðn1 þ 1Þ
Notice that the U score of 5 indicates that most U1 ¼ R 1 ¼ 26 21 ¼ 5:
of Coach Alba’s students are near the bottom 2
(there are not many of Bolt’s students worse than
them) and the much larger U value of 25 indicates This formula gives us the same value of 5 that
that most of Coach Bolt’s students are near the top was calculated by a different method earlier. But
of the ranks. the formula provides a simple method of calcula-
However, consider what the U scores would be tion, without having to laboriously inspect the
if every one of Alba’s students had made up the ranks, as above. (Notice also that the mean rank
bottom six ranks. In this case, none of Alba’s stu- for Alba’s group is Rn 1 ¼ 26
6 ¼ 4:33; below 6, the
1
dents would have been above any of Bolt’s stu- middle rank for 11 results. This also provides an
dents in the ranks, and the U value would have indication that Alba’s students improve less than
been 0. The U value for Bolt’s students would have Bolt’s.)
been 30. The U statistic can be calculated for Coach
Now consider the alternative situation, when Bolt. If Bolt’s students had been in the bottom
the students from the two coaches are evenly five places, their ranks would have added up to 15
spread across the ranks. Here the U values will be (1 þ 2 þ 3 þ 4 þ 5). In actual fact, the sum of the
Mann–Whitney U Test 749
ranks of Bolt’s students is 4 þ 7 þ 8 þ 10 þ 11 ¼ improve less than Bolt’s, then there would be
40. So U2 equals the sum of actual ranks minus only 19 ways of getting 5 or less, and the proba-
the sum of bottom n2 ranks or, expressed as a bility by chance would be .041. With this one-
formula: tailed prediction, a U value of 5 would now
produce a significant difference at the signifi-
n2 ðn2 þ 1Þ cance level of p ¼ .05.
U2 ¼ R 2 ¼ 40 15 ¼ 25:
2 However, one does not normally need to work
out the probability for the calculated U values. It
This is a relatively large value, so Bolt’s students can be looked up in statistical tables. Alternatively,
are generally near the top of the ranks. (Notice a software package will give the probability of a U
also that the mean rank for Bolt’s group is value, which can be compared to the chosen signif-
R2 40
n2 ¼ 5 ¼ 8; above 6, the middle for 11 ranks.) icance level.
And the two values of U are U1 ¼ 5 and U2 ¼ 25, The Mann–Whitney U test is a useful test of
the same as produced by the different method small samples (Mann and Whitney, 1947, gave
earlier. tables of the probability of U for samples of n1
The sum of two U values will always be n1n2, and n2 up to 8), but with large sample sizes (n1
which in this case is 30. While the two U values and n2 both greater than 20), then the U distribu-
are quite different from each other, indicating tion tends to a normal distribution with a mean of
a separation of the samples into the lower and
n1 n2
upper ranks, in statistical tests a result is significant
only if the probability is less than or equal to the 2
significance level (usually, p ¼ .05). and standard deviation of
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The Probability of U n1 n2 ðn1 þ n2 þ 1Þ
:
12
In a two-tailed prediction, researchers always test
the smaller value of U because the probability is With large samples, rather than the U test, a z
based on this value. In the example, there are 11 score can be calculated and the probability looked
scores, 6 in one sample and 5 in the other. What is up in the standard normal distribution tables.
the probability of a U value of 5 or smaller when The accuracy of the Mann–Whitney U test is
the null hypothesis is true? There are 462 possible reduced the more tied ranks there are, and so
permutations for the rank order of 6 scores in one where there are a large number of tied ranks,
11! a correction to the test is necessary. Indeed, in
sample and 5 in the other (5!6! ). Only two permu-
tations produce a U value of zero: all one sample this case, it is often worth considering whether
in the bottom ranks or all the other sample in the a more precise measure of the dependent vari-
bottom ranks. It is not difficult to work out (using able is called for.
a mathematical formula for combinations) that
there are also only two possible ways of getting Perry R. Hinton
a U of 1, four ways of getting a U of 2, six ways
See also Dependent Variable; Independent Variable;
of getting a U of 3, 10 ways for a U of 4, and 14
Nonparametric Statistics; Null Hypothesis; One-Tailed
ways for a U of 5. In sum, there are 38 possible
Test; Sample; Significance Level, Concept of;
ways of producing a U value of 5 or less by chance
Significance Level, Interpretation and Construction;
alone. Dividing 38 by 462 gives a probability of
Two-Tailed Test; Wilcoxon Rank Sum Test; z Score
.082 of this result under the null hypothesis. Thus,
for a two-tailed prediction, with sample sizes of 5
and 6, a U value of 5 is not significant at the Further Readings
p ¼ .05 level of significance. Conover, W. J. (1999). Practical nonparametric statistics
It is interesting to note that if we had made (3rd ed.). New York: Wiley.
a one-tailed prediction, that is, specifically pre- Daniel, W. W. (1990). Applied nonparametric statistics
dicted in advance that Alba’s students would (2nd ed.). Boston: PWS-KENT.
750 Margin of Error
Gibbons, J. D., & Chakraborti, S. (2003). 100ð1 αÞ% represents the confidence level corre-
Nonparametric statistical inference (4th ed.). New sponding to a selected value of α between 0 and 1.
York: Marcel Dekker. Let Qð1 α2Þ denote the 100ð1 α2Þth percentile of
Mann, H. B., & Whitney, D. R. (1947). On a test of ^
ðθθÞ
whether one of two random variables is stochastically the sampling distribution of ^.
SDðθÞ
A symmetric
larger than the other. Annals of Mathematical 100ð1 αÞ% confidence interval is given by
Statistics, 18, 50–60.
Siegel, S., & Castellan, N. J. (1988). Nonparametric α
statistics for the behavioural sciences. New York: θ^ ± Q 1 SDðθÞ^,
2
McGraw-Hill.
leading to a margin of error of
α ^:
MARGIN OF ERROR MEðαÞ ¼ Q 1
2
SDðθÞ
In the popular media, the margin of error is the The size of the margin of error is based on three
most frequently quoted measure of statistical accu- factors: (1) the size of the sample, (2) the variabil-
racy for a sample estimate of a population parame- ity of the data being sampled from the population,
ter. Based on the conventional definition of the and (3) the confidence level (assuming that the
measure, the difference between the estimate and conventional 95% level is not employed). The
the targeted parameter should be bounded by the sample size and the population variability are both
margin of error 95% of the time. Thus, only 1 in reflected in the standard error of the estimator,
20 surveys or studies should lead to a result in ^ which decreases as the sample size increases
SDðθÞ,
which the actual estimation error exceeds the mar- and grows in accordance with the dispersion of
gin of error. the population data. The confidence level is repre-
Technically, the margin of error is defined as the sented by the percentile of the sampling distribu-
radius or the half-width of a symmetric confidence tion, Qð1 α2Þ. This percentile becomes larger as α
interval. To formalize this definition, suppose that is decreased and the corresponding confidence
the targeted population parameter is denoted by θ. level 100ð1 αÞ% is increased.
Let θ^ represent an estimator of θ based on the sam- A common problem in research design is sample
ple data. Let SDðθÞ ^ denote the standard deviation size determination. In estimating a parameter θ, an
^
of θ (if known) or an estimator of the standard investigator often wishes to determine the sample
deviation (if unknown). SDðθÞ ^ is often referred to size n required to ensure that the margin of error
as the standard error. does not exceed some predetermined bound B;
Suppose that the sampling distribution of the that is, to find the n that will ensure MEðαÞ ≤ B.
standardized statistic Solving this problem requires specifying the confi-
dence level as well as quantifying the population
ðθ^ θÞ variability. The latter is often accomplished by
SDðθÞ ^ relying on data from pilot or preliminary studies,
or from prior studies that investigate similar phe-
is symmetric about zero. Let Qð0:975Þ denote the nomena. In some instances (such as when the
97.5th percentile of this distribution. (Note that parameter of interest is a proportion), an upper
the 2.5th percentile of the distribution would then bound can be placed on the population variability.
be given by Qð0:975Þ.) A symmetric 95% confi- The use of such a bound results in a conservative
dence interval for θ is defined as θ^ ± Qð0:975Þ sample size determination; that is, the resulting n
SDðθÞ^ . The half-width or radius of such an inter- is at least as large as (and possibly larger than) the
val, ME ¼ Qð0:975ÞSDðθÞ ^ , defines the conven- sample size actually required to achieve the desired
tional margin of error, which is implicitly based on objective.
a 95% confidence level. Two of the most basic problems in statistical
A more general definition of the margin of error inference consist of estimating a population mean
is based on an arbitrary confidence level. Suppose and estimating a population proportion under
Margin of Error 751
random sampling with replacement. The margins The preceding definition for the margin of error
of error for these problems are presented in the assumes that the standard deviation σ is known,
next sections. an assumption that is unrealistic in practice. If σ is
unknown, it must be estimated by the sample stan-
dard deviation s. In this case, the margin of error is
Margin of Error for Means
based on the sampling distribution of the statistic
Assume that a random sample of size n is drawn
from a quantitative population with mean μ and ðx μÞ
:
standard deviation σ. Let x denote the sample s
pffiffiffi
mean. The standard error of x is then given by n
σ
pffiffiffi : This sampling distribution corresponds to the
n Student’s t distribution, either exactly (under nor-
mality of the population) or approximately (in
Assume that either (a) the population data may
a large sample setting). If Tdf ð1 α2Þ denotes the
be viewed as normally distributed or (b) the sam-
100ð1 α2Þth percentile of the t distribution with
ple size is ‘‘large’’ (typically, 30 or greater). The
df ¼ n 1 degrees of freedom, then the margin
sampling distribution of the standardized statistic
of error for x is given by
ðx μÞ
α s
σ MEðαÞ ¼ Tdf 1 pffiffiffi :
pffiffiffi 2 n
n
then corresponds to the standard normal distribu- Margin of Error for Proportions
tion, either exactly (under normality of the popula-
tion) or approximately (in a large sample setting, Assume that a random sample of size n is drawn
by virtue of the central limit theorem). Let from a qualitative population where π denotes
Zð1 α2Þ denote the 100ð1 α2Þth percentile of this a proportion based on a characteristic of interest.
distribution. The margin of error for x is then Let p denote the sample proportion. The standard
given by error of p is then given by
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
α σ πð1 πÞ
MEðαÞ ¼ Z 1 pffiffiffi : :
2 n n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
With the conventional 95% confidence level, Here, πð1 πÞ represents the standard deviation
α ¼ 0:05 and Zð0:975Þ ¼ 1:96 ≈ 2, leading to of the binary (0=1) population data, in which each
ME ¼ 2 ðpσffiffin Þ: object is dichotomized according to whether it
In research design, the sample size needed to exhibits the characteristic in question. The sample
ensure that the margin of error does not exceed version of the standard error is
some bound B is determined by finding the smal- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lest n that will ensure pð1 pÞ
:
n
α σ 2
n ≥ Z 1 :
2 B Assume that the sample size is ‘‘large’’ (typically,
such that nπ and nð1 πÞ are both at least 10). The
In instances in which the population standard sampling distribution of the standardized statistic
deviation σ cannot be estimated based on data col-
lected from earlier studies, a conservative approxi- ðp πÞ
mation of σ can be made by taking one fourth the rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!
πð1 πÞ
plausible range of the variable of interest (i.e., ð14Þ
n
[maximum–minimum]).
752 Margin of Error
can then be approximated by the standard normal from the population at random with replacement.
distribution. The margin of error for x is then If the sample is drawn at random without replace-
given by ment, and the sampling fraction is relatively high
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! (e.g., 5% or more), the formulas should be adjusted
α pð1 pÞ by a finite population correction (fpc). If N denotes
MEðαÞ ¼ Z 1 :
2 n the size of the population, this correction is given as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
With the conventional 95% confidence level, ðN nÞ
qffiffiffiffiffiffiffiffiffiffiffi fpc ¼ :
pð1pÞ ðN 1Þ
ME ¼ 2 n :
In research design for sample size determination, Employing this correction has the effect of
for applications in which no data exist from previous reducing the margin of error. As n approaches
studies for estimating the proportion of interest, the N, the fpc becomes smaller. When N ¼ n, the
computation is often based on bounding the popula-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi entire population is sampled. In this case, the fpc
tion standard deviation πð1 πÞ. Using calculus, and the margin of error are zero because the
it can be shown that this quantity achieves a maxi- sample estimate is equal to the population
mum value of 1/2 when the proportion π is 1/2. The parameter and there is no estimation error. In
maximum margin of error is then defined as practice, the fpc is generally ignored because the
size of the sample is usually small relative to the
α 1
MEMAX ðαÞ ¼ Z 1 pffiffiffi : size of the population.
2 2 n
The sample size needed to ensure that this margin Generalization to Two Parameters
of error does not exceed some bound B is deter- The preceding definitions for the margin of error
mined by finding the smallest n that will ensure are easily generalized to settings in which two tar-
Zð1αÞ2
n ≥ 2
. geted population parameters are of interest, say θ1
2B
Often in public opinion polls and other surveys, and θ2 . Two parameters are often compared by
a number of population proportions are estimated, estimating their difference, say θ1 θ2. Let θ^1 and
and a single margin of error is quoted for the θ^2 represent estimators of θ1 and θ2 based on the
entire set of estimates. This is generally accom- sample data. Let SDðθ^1 θ^2 Þ denote the standard
plished with the preceding maximum margin of error of the difference in the estimators θ^1 θ^2 .
error MEMAX ðαÞ, which is guaranteed to be at Confidence intervals for θ1 θ2 may be based
least as large as the margin of error for any of the on the standardized statistic
individual estimates in the set. When the conven-
ðθ^1 θ^2 Þ ðθ1 θ2 Þ
tional 95% confidence level is employed, the maxi- :
mum margin of error has a particularly simple SDðθ^1 θ^2 Þ
form:
As before, suppose that the sampling distribution
1 of this statistic is symmetric about zero, and let
MEMAX ¼ pffiffiffi : Qð0:975Þ denote the 97.5th percentile of this dis-
n
tribution. A symmetric 95% confidence interval
National opinion polls are often based on a sample for θ is defined as
size of roughly 1,100 participants, leading to the
margin of error MEMAX ≈ 0:03, or 3 percentage ðθ^1 θ^2 Þ ± Qð0:975ÞSDðθ^1 θ^2 Þ:
points.
The half-width or radius of such an interval,
ME ¼ Qð0:975ÞSDðθ^1 θ^2 Þ, defines the conven-
Finite Population Correction
tional margin of error for the difference θ^1 θ^2 ,
The preceding formulas for margins of error are which is based on a 95% confidence level. The
based on the assumption that the sample is drawn more general definition based on an arbitrary
Markov Chains 753
confidence level 100ð1 αÞ% may be obtained by statistics that indicates that the sum of a large num-
replacing the 97.5th percentile Qð0:975Þ with the ber of independent random variables is asymptoti-
100ð1 α2Þth percentile Qð1 α2Þ, leading to cally distributed as a normal distribution.
Markov was a well-trained mathematician who
α
MEðαÞ ¼ Q 1 SDðθ^1 θ^2 Þ : after 1900 emphasized inquiry in probability. After
2 studying sequences of independent chance events,
Joseph E. Cavanaugh and Eric D. Foster he became interested in sequences of mutually
dependent events. This inquiry led to the creation
See also Confidence Intervals; Estimation; Sample Size
of Markov chains.
Planning; Standard Deviation; Standard Error of
Estimate; Standard Error of the Mean; Student’s t Test;
z Distribution Sequences of Chance Events
Markov chains are sequences of chance events. A
Further Readings
series of flips of a fair coin is a typical sequence of
Lohr, S. L. (1999). Sampling: Design and analysis. Pacific chance events. Each coin flip has two possible out-
Grove, CA: Duxbury Press. comes: Either a head (H) appears or a tail (T)
Utts, J. M. (2005). Seeing through statistics (3rd ed.). appears. With a fair coin, a head will appear with
Belmont, CA: Thomson Brooks/Cole. a probability (p) of 1/2 and a tail will appear with
Utts, J. M., & Heckard, R. F. (2007). Mind on statistics
a probability of 1/2.
(3rd ed.). Belmont, CA: Thomson Brooks/Cole.
Successive coin flips are independent of each
other in the sense that the probability of a head or
a tail on the first flip does not affect the probability
MARKOV CHAINS of a head or a tail on the second flip. In the case of
the sequence HT, the pðHTÞ ¼ pðHÞ × pðTÞ ¼
The topic of Markov chains is a well-developed 1=2 × 1=2 ¼ 1=4. Many sequences of chance
topic in probability. There are many fine exposi- events are composed of independent chance events
tions of Markov chains (e.g., Bremaud, 2008; such as coin flips or dice throws. However, some
Feller, 1968; Hoel, Port, & Stone, 1972; Kemeny sequences of chance events are not composed of
& Snell, 1960). Those expositions and others have independent chance events. Some sequences of
informed this concise entry on Markov chains, chance events are composed of events whose
which is not intended to exhaust the topic of Mar- occurrences are influenced by prior chance events.
kov chains. The topic is just too capacious for Markov chains are such sequences of chance
that. This entry provides an exposition of a judi- events.
cious sampling of the major ideas, concepts, and As an example of a sequence of chance events
methods regarding the topic. that involves interdependence of events, let us con-
sider a sequence of events E1 ; E2 ; E3 ; E4 such that
the probability of any of the events after E1 is
Andrei Andreevich Markov
a function of the prior event. Instead of interpret-
Andrei Andreevich Markov (1856–1922) formu- ing the pðE1 ; E2 ; E3 ; E4 ) as the product of proba-
lated the seminal concept in the field of probability bilities of independent events, pðE1 ; E2 ; E3 ; E4 ) is
later known as the Markov chain. Markov was an the product of an initial event probability and the
eminent Russian mathematician who served as conditional probabilities of successive events.
a professor in the Academy of Sciences at the Uni- From this perspective,
versity of St. Petersburg. One of Markov’s teachers
was Pafnuty Chebyshev, a noted mathematician pðE1 ; E2 ; E3 ; E4 Þ ¼ pðE1 Þ × pðE2 jE1 Þ × pðE3 jE2 Þ
who formulated the famous inequality termed the × pðE4 jE3 Þ:
Chebyschev inequality, which is extensively used in
probability and statistics. Markov was the first per- Such a sequence of events is a Markov chain.
son to provide a clear proof of the central limit the- Let us consider a sequence of chance events E1,
orem, a pivotal theorem in probability and E2, E3 ; . . . ; Ej ; . . . ; En. If p(Ej|Ej 1) ¼ p(Ej|Ej 1,
754 Markov Chains
Ej 2 ; . . . ; E1), then the sequence of chance events example. A type of job certification involves
is a Markov chain. For a more formal definition, if a test with three resulting states. With state S1,
X1 ; X2 ; . . . ; Xn are random variables and if one fails the test with a failing score and then
p(Xn ¼ kn|Xn 1 ¼ kn 1) ¼ p(Xn ¼ kn|Xn 1 ¼ maintains that failure status. With state S2, one
kn 1 ; . . . ; X2 ¼ k2 ; X1 ¼ k2 Þ; then X1 ; X2 ; . . . ; Xn passes the test with a low pass score that is
form a Markov chain. inadequate for certification. One then takes the
Conditional probabilities interrelating events test again. Either one attains state S2 again
are important in defining a Markov chain. Com- with a probability of .5 or one passes the test
mon in expositions of Markov chains, conditional with a high pass score and reaches state S3 with
probabilities interrelating events are termed transi- a probability of .5. With state S3, one passes the
tion probabilities interrelating states, events are test with a high pass score that warrants job
termed states, and a set of states is often termed certification.
a system or a state space. The states in a Markov
chain are either finite or countably infinite; this S1 S2 S3
entry will feature systems or state spaces whose S1 1 0 0
states are finite in number. P2 = S2 0 .5 .5
A matrix of transition probabilities is used to
represent the interstate transitions possible for S3 0 0 1
a Markov chain. As an example, consider the fol-
lowing system of states, S1, S2, and S3, with the P2 indicates the transition probabilities among
following matrix of transition probabilities, P1. the three states of the job certification process.
From an examination of P2, state S1 is an absorb-
S1 S2 S3 ing state because p11 ¼ 1 and p1j ¼ 0 for 1 6¼ j,
S1 .1 .8 .1 and state S3 is an absorbing state because p33 ¼ 1
and p3j ¼ 0 for 3 6¼ j. However, state S2 is a tran-
P1 = S2 .5 .3 .2 sient state because there is a nonzero probability
S3 .1 .7 .2 that the state will never be reached again.
To illustrate that state S2 will never be reached
Using the matrix P1, one can see that the tran- again, let us examine what happens as the succes-
sition probability of entering S1 given that one is sive steps in the Markov chain occur. P2 indicates
in S2 is .8. That same probability is represented the transition probabilities among the three
as p12. states of the job certification process. P22 indicates
the transition probabilities after two steps in the
process.
Features of Markov Chains and Their States
S1 S2 S3
There are various types of states in a Markov
chain. Some types of states in a Markov chain S1 1 0 0
relate to the degree to which states recur over time. 2
P2 = S2 0 .25 .75
A recurrent state is one that will return to itself
S3 0 0 1
before an infinite number of steps with probability
1. An absorbing state i is a recurrent state for
which pii ¼ 1 and pij ¼ 0 for i 6¼ j. In other words, After two steps, p22 has decreased to .25 and
if it is not possible to leave a given state, then that p23 has increased to .75. P32 indicates the transition
state is an absorbing state. Second, a state is tran- probabilities after two steps in the process.
sient if it is not recurrent. In other words, if the
S1 S2 S3
probability that a state will occur again before an
infinite number of steps is less than 1, then that S1 1 0 0
state is transient. 3
P2 = S2 0 .06 .94
To illustrate the attributes of absorbing and
S3 0 0 1
transient states, let us consider the following
Markov Chains 755
.5 .5 0
After one step in the Markov chain, P25 indicates
the resulting transition probabilities. ðp61 ; p62 ; p63 Þ ¼ ðp61 ; p62 ; p63 Þ · .5 0 .5
0 .5 .5
1 0 0
2 ¼ ð:5p61 þ .5p62, .5p61 þ .5p63, .5p62 þ .5p63)
P5 = 0 1 0
0 0 1
This results in three equations:
Along with those three equations is the additional matching procedure requires defining a notion of
equation distance, selecting the number of matches to be
found, and deciding whether units will be used mul-
4. π61 þ π62 þ π63 ¼ 1. tiple times as a potential match. In applications,
matching is commonly used as a preliminary step in
An arithmetic manipulation of these four equa- the construction of a matched sample, that is, a sam-
tions results in numerical solutions for the three ple of observations that are similar in terms of
unknowns: π61 ¼ π62 ¼ π63 ¼ 1/3 ¼ .33. These sta- observed characteristics, and then some statistical
tionary probabilities indicate that Harry would be procedure is computed with this subsample. Typi-
at any of the three locations with equal likelihood cally, the term matching estimator refers to the case
after many steps in the random walk.2 when the statistical procedure of interest is a point
estimator, such as the sample mean. The idea of
Conclusion matching is usually employed in the context of
observational studies, in which it is assumed that
If there is a sequence of random events such that
selection into treatment, if present, is based on
a future event is dependent only on the present
observable characteristics. More generally, under
event and not on past events, then the sequence is
appropriate assumptions, matching may be used as
likely a Markov chain, and the work of Markov
a way of reducing variability in estimation, combin-
and others may be used to extract useful informa-
ing databases from different sources, dealing with
tion from an analysis of the sequence. The topic of
missing data, and designing sampling strategies,
the Markov chain has become one of the most
among other possibilities. Finally, in the economet-
captivating, generative, and useful topics in proba-
rics literature, the term matching is sometimes used
bility and statistics.
more broadly to refer to a class of estimators that
William M. Bart and Thomas Bart exploit the idea of selection on observables in the
context of program evaluation. This entry focuses
See also Matrix Algebra; Probability, Laws of on the implementation of and statistical inference
procedures for matching.
Further Readings
Bremaud, P. (2008). Markov chains. New York: Springer. Description and Implementation
Feller, W. (1968). An introduction to probability theory and
its applications: Volume 1 (3rd ed.). New York: Wiley. A natural way of describing matching formally is
Hoel, P., Port, S., & Stone, C. (1972). Introduction to in the context of the classical potential outcomes
stochastic processes. Boston: Houghton Mifflin. model. To describe this model, suppose that a ran-
Kemeny, J., & Snell, J. (1960). Finite Markov chains. dom sample of size n is available from a large
Princeton, NJ: D. Van Nostrand. population, which is represented by the collection
of random variables (Yi ; Ti ; Xi Þ, i ¼ 1; 2; . . . ; n;
where Ti ∈ f0; 1g,
MATCHING Y0i if Ti ¼ 0
Yi ¼
Y1i if Ti ¼ 1
The term matching refers to the procedure of find-
ing for a sample unit other units in the sample that and Xi represents a (possibly high-dimensional)
are closest in terms of observable characteristics. vector of observed characteristics. This model aims
The units selected are usually referred to as matches, to capture the idea that while the set of character-
and after repeating this procedure for all units (or istics Xi is observed for all units, only one of the
a subgroup of them), the resulting subsample of two random variables (Y0i ; Y1i ) is observed for
units is called the matched sample. This idea is typi- each unit, depending on the value of Ti . The
cally implemented across subgroups of a given sam- underlying random variables Y0i and Y1i are usu-
ple, that is, for each unit in one subgroup, matches ally referred to as potential outcomes because they
are found among units of another subgroup. A represent the two potential states for each unit.
Matching 759
For example, this model is routinely used in the To describe a matching procedure in detail, con-
program evaluation literature, where Ti represents sider the special case of matching that uses the
treatment status and Y0i and Y1i represent out- Euclidean distance to obtain M ≥ 1 matches with
comes without and with treatment, respectively. In replacement for the two groups of observations
most applications the goal is to establish statistical defined by Ti ¼ 0 and Ti ¼ 1, using as a reservoir
inference for some characteristic of the distribution of potential matches for each unit i the group
of the potential outcomes such as the mean or opposite to the group this unit belongs to. Then,
quantiles. However, using the available sample for unit i the mth match, m ¼ 1, 2, . . . , M is given
directly to establish inference may lead to impor- by the observation having index jm ðiÞ such that
tant biases in the estimation whenever units have
selected into one of the two possible groups Tjm ðiÞ 6¼ Ti and
(Ti ¼ 0 or Ti ¼ 1). As a consequence, researchers X n
often assume that the selection process, if present, 1fTj 6¼ Ti g1fkXj Xi k ≤ kXjm ðiÞ Xi kg¼m:
is based on observable characteristics. This idea is j¼1
formalized by the so-called conditional indepen-
dence assumption: conditionally on Xi ; the ran- (The function 1f·g is the indicator function and k·k
dom variables (Y0i ; Y1i ) are independent of Ti : In represents the Euclidean norm.) In words, for the
other words, under this assumption, units having ith unit, the mth match corresponds to the
the same observable characteristics Xi are assigned mth nearest neighbor among those observations
to each of the two groups (Ti ¼ 0 or Ti ¼ 1Þ inde- belonging to the opposite group of unit i, as mea-
pendently of their potential gains, captured by sured by the Euclidean distance between their
(Y0i ; Y1i ). Thus, this assumption imposes random observable characteristics. For example, if m ¼ 1,
treatment assignment conditional on Xi : This then j1 ðiÞ corresponds to the unit’s index in the
model also assumes some form of overlap or com- opposite group of unit i with the property that
mon support: For some c > 0; c ≤ PðTi ¼ 1jXi Þ ≤ kXj1 ðiÞ ; Xi k ≤ kXj Xi k for all j such that
1 c: In words, this additional assumption ensures Tj 6¼ Ti ; that is, Xjm ðiÞ is the observation closest to
that there will be observations in both groups hav- Xi among all the observations in the appropriate
ing a common value of observed characteristics if group. Similarly, Xj1 ðiÞ ; Xj2 ðiÞ ; . . . ; XjM ðiÞ are the sec-
the sample size is large enough. The function ond closest, third closest, and so forth, observa-
pðXi Þ ¼ PðTi ¼ 1jXi Þ is known as the propensity tions to Xi , among those observations in the
score and plays an important role in the literature. appropriate subsample. Notice that to simplify the
Finally, it is important to note that for many appli- discussion, this definition assumes existence and
cations of interest, the model described above uniqueness of an observation with index jm ðiÞ. (It
employs stronger assumptions than needed. For is possible to modify the matching procedure to
simplicity, however, the following discussion does account for these problems.)
not address these distinctions. In general, the always observed random vector
This setup naturally motivates matching: obser- Xi may include both discrete and continuous ran-
vations sharing common (or very similar) values of dom variables. When the distribution of (a subvec-
the observable characteristics Xi are assumed to be tor of) Xi is discrete, the matching procedure may
free of any selection biases, rendering the statistical be done exactly in large samples, leading to so-
inference that uses these observations valid. Of called exact matching. However, for those compo-
course, matching is not the only way of conducting nents of Xi that are continuously distributed,
correct inference in this model. Several parametric, matching cannot be done exactly, and therefore in
semiparametric, and nonparametric techniques are any given sample there will be a discrepancy in
available, depending on the object of interest and terms of observable characteristics, sometimes
the assumptions imposed. Nonetheless, matching called the matching discrepancy. This discrepancy
is an attractive procedure because it does not generates a bias that may affect inference even
require employing smoothing techniques and asymptotically.
appears to be less sensitive to some choices of user- The M matches for unit i are given by the obser-
defined tuning parameters. vations with indexes JM ðiÞ ¼ fj1 ðiÞ; . . . ; jM ðiÞg, that
760 Matching
is, Yj1 ðiÞ ; Xj1 ðiÞ ; . . . ; YjM ðiÞ ; XjM ðiÞ . This procedure so-called genetic matching, which uses evolutionary
is repeated for the appropriate subsample of units to genetic algorithms to construct the matched sample,
obtain the final matched sample. Once the matched appears to work well with moderate sample sizes.
sample is available, the statistical procedure of inter- This implementation allows for a generalized notion
est may be computed. To this end, the first step is to of distance (a reweighted Euclidean norm that
‘‘recover’’ those counterfactual variables not observed includes the Mahalanobis metric as a particular case)
for each unit, which in the context of matching is and an arbitrary number of matches with and with-
done by imputation. For example, first define out replacement.
8 There exist several generalizations of the basic
< Yi
> if Ti ¼ 0 matching procedure described above, a particularly
^ 1 X
Y0i ¼ Yj if Ti ¼ 1 and important one being the so-called optimal full
>
:M
j ∈ JM ðiÞ
matching. This procedure generalizes the idea of pair
8 or M matching by constructing multiple submatched
> 1 X
< Yj if Ti ¼ 0 samples that may include more than one observation
Y^1i ¼ M j ∈ J ðiÞ from each group. This procedure encompasses the
>
:
M
Person Balls Cars Coins Novels For either convenience or clarity, the number of
rows and columns can also be indicated as sub-
Toto 2 5 10 20
scripts below the matrix name:
Marius 1 2 3 4
Olivette 6 1 3 10 A ¼ A ¼ ½ai;j : ð3Þ
I×J
223
In general
3
x 6 7
x¼ ¼ 4 13 5: ð7Þ
jjxjj 2 AþB¼
3 2 3
a1;1 þ b1;1 a1;2 þ b1;2 a1;j þ b1;j a1;J þ b1;J
6 7
6 a2;1 þ b2;1 a2;2 þ b2;2 a2;j þ b2;j a2;J þ b2;J 7
Operations for Matrices 6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7
Transposition 6 7
6 7:
6 ai;1 þ bi;1 ai;2 þ bi;2 ai;j þ bi;j ai;J þ bi;J 7
If we exchange the roles of the rows and the 6 7
6 .. .. .. .. .. .. 7
columns of a matrix, we transpose it. This opera- 6 . . . . . . 7
4 5
tion is called the transposition, and the new matrix
aI;1 þ bI;1 aI;2 þ bI;2 aI;j þ bI;j aI;J þ bI;J
is called a transposed matrix. The A matrix trans-
posed is denoted AT. For example: ð11Þ
2 3
2 5 10 20 Matrix addition behaves very much like usual
6 7 addition. Specifically, matrix addition is commuta-
if A ¼ A ¼ 4 1 2 3 4 5; then
3×4
6 1 3 10 tive (i.e., A þ B ¼ B þ AÞ and associative (i.e.,
A þ ½B þ C ¼ ½A þ B þ C).
2 3 ð8Þ
2 1 6
Multiplication of a Matrix by a Scalar
6 5 2 1 7
T 6
T 7
A ¼ A ¼6 7: To differentiate matrices from the usual num-
4×3 4 10 3 3 5
bers, we call the latter scalar numbers or simply
20 4 10 scalars. To multiply a matrix by a scalar, multiply
each element of the matrix by this scalar. For
example:
Addition of Matrices
2 3
When two matrices have the same dimensions, 3 4 5 6
6 7
we compute their sum by adding the correspond- 10 × B ¼ 10 × 4 2 4 6 85
ing elements. For example, with 1 2 3 5
2 3 2 3 ð12Þ
2 5 10 20 30 40 50 60
6 7 6 7
A ¼ 41 2 3 4 5and ¼ 4 20 40 60 80 5:
6 1 3 10 10 20 30 50
2 3 ð9Þ
3 4 5 6
6 7
B ¼ 42 4 6 8 5; Multiplication: Product or Products?
1 2 3 5
There are several ways of generalizing the con-
we find cept of product to matrices. We will look at the
most frequently used of these matrix products.
2 3
2þ3 5 þ 4 10 þ 5 20 þ 6 Each of these products will behave like the product
6 7 between scalars when the matrices have dimen-
A þ B ¼ 41 þ 2 2þ4 3þ6 4þ8 5
sions 1 × 1.
6þ1 1þ2 3þ3 10 þ 5
2 3
5 9 15 26
6 7 Hadamard Product
¼ 43 6 9 12 5:
When generalizing product to matrices, the first
7 3 6 15
approach is to multiply the corresponding ele-
ð10Þ ments of the two matrices that we want to
764 Matrix Algebra
multiply. This is called the Hadamard product, for matrices with the same dimensions. Formally,
denoted by . The Hadamard product exists only it is defined as shown below, in matrix 13:
A B ¼ ½ai;j × bi;j
2 3
a1;1 × b1;1 a1;2 × b1;2 a1;j × b1;j a1;J × b1;J
6 7
6 a2;1 × b2;1 a2;2 × b2;2 a2;j × b2;j a2;J × b2;J 7
6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7 ð13Þ
6 7
¼6 7:
6 ai;1 × bi;1 ai;2 × bi;2 ai;j × bi;j ai;J × bi;J 7
6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7
4 5
aI;1 × bI;1 aI;2 × bI;2 aI;j × bI;j aI;J × bI;J
we get
2 3 2 3
2 × 3 5 × 4 10 × 5 20 × 6 6 20 50 120
4
A B ¼ 1×2 2×4 3×6 5
4×8 ¼ 2 84 18 32 5: ð15Þ
6 × 1 1 × 2 3 × 3 10 × 5 6 2 9 50
the context is clear). To compute c2,1 we add three For example, with
terms: (1) the product of the first element of
the second row of A (i.e., 4) with the first element 2 1 1 1
A¼ and B ¼ ð25Þ
of the first column of B (i.e., 1); (2) the product of 2 1 2 2
the second element of the second row of A (i.e., 5)
with the second element of the first column of B we get:
(i.e., 3); and (3) the product of the third element of
the second row of A (i.e., 5) with the third element 2 1 1 1 0 0
AB ¼ ¼ : ð26Þ
of the first column of B (i.e., 5). Formally, the term 2 1 2 2 0 0
c2,1 is obtained as
But
X
J¼3
c2;1 ¼ a2;j × bj;1
1 1 2 1
j¼1 BA ¼
2 2 2 1
¼ ða2;1 Þ × ðb1;1 Þ þ ða2;2 × b2;1 Þ þ ða2;3 × b3;1 Þ
¼ ð4 × 1Þ þ ð5 × 3Þ þ ð6 × 5Þ 4 2
¼ : ð27Þ
¼ 49: 8 4
ð20Þ
Incidently, we can combine transposition and pro-
Matrix C is obtained as duct and get the following equation:
AB ¼ ðABÞT ¼ BT AT : ð28Þ
C ¼ ½ci;k
X
J¼3
¼ ai;j × bj;k Exotic Product: Kronecker
j¼1
Another product is the Kronecker product,
1×1 þ 2×3 þ 3×5 1×2 þ 2×4 þ 3×6 also called the direct, tensor, or Zehfuss
¼
4×1 þ 5×3 þ 6×5 4×2 þ 5×4 þ 6×6 product. It is denoted and is defined for all
22 28 matrices. Specifically, with two matrices A ¼ ai,j
¼ : (with dimensions I by J) and B (with dimens-
49 64
ions K and L), the Kronecker product gives
ð21Þ a matrix C (with dimensions (I × K) by (J × L)
defined as
2 3
Properties of the Product a1;1 B a1;2 B a1;j B a1;J B
6 7
Like the product between scalars, the pro- 6 a2;1 B a2;2 B a2;j B a2;J B 7
6 7
duct between matrices is associative, and distri- 6 . .. .. .. .. .. 7
6 .. . . . . . 7
butive relative to addition. Specifically, for any 6 7
AB¼6 7:
set of three conformable matrices A, B, and C, 6 ai;1 B ai;2 B ai;j B ai;J B 7
6 7
6 . .. .. .. .. .. 7
ðABÞC ¼ AðBCÞ ¼ ABC associativity ð22Þ 6 .. . . . . . 7
4 5
aI;1 B aI;2 B aI;j B aI;J B
AðB þ CÞ ¼ AB þ AC distributivity: ð23Þ ð29Þ
The matrix products AB and BA do not always
exist, but when they do, these products are not, in For example, with
general, commutative:
6 7
A ¼ ½1 2 3 and B ¼ ð30Þ
AB 6¼ BA: ð24Þ 8 9
766 Matrix Algebra
AB
A ¼ AT : ð36Þ
1×6 1×7 2×6 2×7 3×6 3×7
¼
1×8 1×9 2×8 2×9 3×8 3×9 A common mistake is to assume that the stan-
6 7 12 14 18 21 dard product of two symmetric matrices is com-
¼ :
8 9 16 18 24 27 mutative. But this is not true, as shown by the
ð31Þ following example. With
2 3 2 3
The Kronecker product is used to write design 1 2 3 1 1 2
matrices. It is an essential tool for the derivation of A ¼ 42 1 4 5 and B ¼ 4 1 1 35 ð37Þ
expected values and sampling distributions. 3 4 1 2 3 1
2 3
For example, the previous matrix can be rewritten 2
0 0
1 2 3
as: AC ¼ ×40 4 05
2 3 4 5 6
0 0 6
10 0 0
A ¼ 4 0 20 0 5 ¼ diagf½10; 20; 30g: ð42Þ 2 8 18
0 0 30 ¼ ð48Þ
8 20 36
The operator diag can also be used to isolate the
diagonal of any square matrix. For example, with and also
2 3 2 3
1 2 3 2 0 0
A ¼ 44 5 65 ð43Þ 2 0 1 2 3 6 7
7 8 9 BAC ¼ × ×40 4 05
0 5 4 5 6
0 0 6
we get
4 16 36
82 39 2 3 ¼ :
< 1 2 3 = 1 40 100 180
diagfAg ¼ diag 4 4 5 6 5 ¼ 4 5 5: ð44Þ ð49Þ
: ;
7 8 9 9
1×1 2×1 3×1 1 2 3 Triangular Matrix
¼ ¼ : ð54Þ
4×1 5×1 6×1 4 5 6 A matrix is lower triangular when ai,j ¼ 0 for
i < j. A matrix is upper triangular when ai,j ¼ 0
The matrices can also be used to compute sums of
for i > j. For example,
rows or columns:
2 3
2 3 10 0 0
1
6 7 A ¼ 4 2 20 0 5 is lower triangular, ð60Þ
½ 1 2 3 × 4 1 5 ¼ ð1 × 1Þ þ ð2 × 1Þ þ ð3 × 1Þ 3 5 30
1
¼ 1 þ 2 þ 3 ¼ 6; and
2 3
ð55Þ 12 2 3
B¼4 0 20 5 5 is upper triangular: ð61Þ
or also 0 0 30
1 2 3
½1 1× ¼ ½5 7 9 : ð56Þ Cross-Product Matrix
4 5 6
A cross-product matrix is obtained by multipli-
cation of a matrix by its transpose. Therefore
Matrix Full of Zeros
a cross-product matrix is square and symmetric.
For example, the matrix:
A matrix whose elements are all equal to 0 is 2 3
the null or zero matrix. It is denoted by 0 or, when 1 1
we need to specify its dimensions, by 0 . Null A ¼ 42 45 ð62Þ
I×J
3 4
matrices are neutral elements for addition:
premultiplied by its transpose
1 2 1þ0 2þ0
þ 0 ¼
3 4 2×2 3þ0 4þ0 1 2 3
AT ¼ ð63Þ
1 4 4
1 2
¼ : ð57Þ
3 4 gives the cross-product matrix
They are also null elements for the Hadamard
AT A
product:
1×1 þ 2×2 þ 3×3 1×1 þ 2×4 þ 3×4
1 2 1×0 2×0 ¼
0 ¼ 1×1 þ 4×2 þ 4×3 1×1 þ 4×4 þ 4×4
3 4 2×2 3×0 4×0
14 21
¼ :
0 0 14 33
¼ ¼ 0 ð58Þ
0 0 2×2 ð64Þ
Matrix Algebra 769
Au 1
8
u1 1
u2
2
1
3 12 −1
Au
−1 2
(a) (b)
we obtain
Eigenvector and Eigenvalue Matrices
1 3 1 4 0 :2 :2
Traditionally, we store the eigenvectors of A as A ¼ UΛU ¼
2 1 0 1 :4 :6
the columns of a matrix denoted U. Eigenvalues
are stored in a diagonal matrix (denoted Λ). 2 3
¼ :
Therefore, Equation 79 becomes 2 1
ð93Þ
AU ¼ UΛ: ð90Þ
Digression: An Infinity of
For example, with A (from Equation 81), we have Eigenvectors for One Eigenvalue
It is only through a slight abuse of language that
2 3 3 1 3 1 4 0
× ¼ × : we talk about the eigenvector associated with one
2 1 2 1 2 1 0 1
eigenvalue. Any scalar multiple of an eigenvector
ð91Þ is an eigenvector, so for each eigenvalue, there are
772 Matrix Algebra
with
Positive (Semi)Definite Matrices 2 qffiffi qffiffi 32 qffiffi qffiffi 3
1 1 1 1
0 1 2
4 qffiffi 2
qffiffi 54 qffiffi 2
q2ffiffi 5 ¼ 1; 001: ð103Þ
Some matrices, such as ; do not have
0 0 1
1 1
12
2 2 2
eigenvalues. Fortunately, the matrices used often in
statistics belong to a category called positive semide-
finite. The eigendecomposition of these matrices Diagonalization
always exists and has a particularly convenient When a matrix is positive semidefinite, we can
form. A matrix is positive semidefinite when it can rewrite Equation 100 as
be obtained as the product of a matrix by its trans-
pose. This implies that a positive semidefinite A ¼ UΛUT , ¼ UT AU: ð104Þ
matrix is always symmetric. So, formally, the matrix
A is positive semidefinite if it can be obtained as This shows that we can transform A into a diago-
nal matrix. Therefore the eigendecomposition of
A ¼ XXT ð98Þ a positive semidefinite matrix is often called its
diagonalization.
for a certain matrix X. Positive semidefinite matrices
include correlation, covariance, and cross-product Another Definition for
matrices. Positive Semidefinite Matrices
The eigenvalues of a positive semidefinite matrix A matrix A is positive semidefinite if for any
are always positive or null. Its eigenvectors are nonzero vector x, we have
composed of real values and are pairwise orthogo-
nal when their eigenvalues are different. This xT Ax ≥ 0 8x: ð105Þ
implies the following equality:
When all the eigenvalues of a matrix are positive,
U1 ¼ UT : ð99Þ the matrix is positive definite. In that case, Equa-
tion 105 becomes
We can, therefore, express the positive semidefinite
matrix A as xT Ax > 0 8x: ð106Þ
Matrix Algebra 773
with Λ being the matrix of the eigenvalues of A. The eigendecomposition is essential in optimiza-
For the previous example, we have tion. For example, principal components analysis
is a technique used to analyze an I × J matrix X in
Λ ¼ diagf16:1168; 1:1168; 0g: ð110Þ which the rows are observations and the columns
are variables. Principal components analysis finds
We can verify that orthogonal row factor scores that ‘‘explain’’ as
X much of the variance of X as possible. They are
tracefAg ¼ λ‘ ¼ 16:1168 obtained as
‘ ð111Þ
þ ð1:1168Þ ¼ 15: F ¼ XP; ð115Þ
the multiplication with a diagonal matrix of Lag- The eigendecomposition decomposes a matrix into
rangian multipliers denoted Λ; in order to give the two simple matrices, and the SVD decomposes
following expression: a rectangular matrix into three simple matrices: two
orthogonal matrices and one diagonal matrix. The
Λ PT P I : ð118Þ SVD uses the eigendecomposition of a positive
semidefinite matrix to derive a similar decomposi-
This amounts to defining the following equation: tion for rectangular matrices.
L ¼ FT F ΛðPT P IÞ
can be expressed concisely only in terms of matrix advocate the use of the local invariant test over the
algebra: Mauchly test.
Finally, some statisticians have called into ques-
jC0 SCj tion the utility of conducting any preliminary test
W¼h ik1 ;
0
trðC SCÞ of sphericity such as the Mauchly test. For repe-
k1
ated measures data sets in the social sciences, they
argue, sphericity is almost always violated to some
where S is the k × k sample covariance matrix. degree, and thus researchers should universally
One can rely on either an approximate or exact correct for this violation (by adjusting df with the
sampling distribution to determine the probability Greenhouse–Geisser and the Huynh–Feldt esti-
value of an obtained W value. Because of the cum- mates). Furthermore, like any significance test, the
bersome computations required to determine exact Mauchly test is limited in its utility by sample size:
p values and the precision of the chi-square appr- For large samples, small violations of sphericity
oximation, even statistical software packages (e.g., often produce significant Mauchly test results, and
SPSS, an IBM company, formerly called PASWÓ for small samples, the Mauchly test often does not
Statistics) typically rely on the latter. The chi- have the power to detect large violations of sphe-
square approximation is based on the statistic ricity. Finally, critics of sphericity testing note that
ðn 1ÞdW with degrees of freedom (df Þ ¼ adoption of the df correction tests only when the
kðk 1Þ=2 1, where Mauchly test reveals significant nonsphericity—as
" # opposed to always adopting such df correction
2
2ðk 1Þ þ ðk 1Þ þ 2 tests—does not produce fewer Type I or II errors
d ¼1 :
6ðk 1Þðn 1Þ under typical testing conditions (as shown by sim-
ulation research).
For critical values for the exact distribution, see Aside from using alternative tests of sphericity
Nagarsenker and Pillai (1973). or forgoing such tests in favor of adjusted df tests,
researchers who collect data on repeated measures
should also consider employing statistical models
Limitations and Critiques
that do not assume sphericity. Of these alternative
The Mauchly test is not robust to nonnormality: models, the most common is multivariate ANOVA
Small departures from multivariate normality in (MANOVA). Power analyses have shown that the
the population distribution can lead to artificially univariate ANOVA approach possesses greater
low or high Type I error (i.e., false positive) rates. power than the MANOVA approach when sample
In particular, heavy-tailed (leptokurtic) distribu- size is small (n < k þ 10) or the sphericity viola-
tions can—under typical sample sizes and signifi- tion is not large (ε > .7) but that the opposite is
cance thresholds—triple or quadruple the number true when sample sizes are large and the sphericity
of Type I errors beyond their expected rate. Res- violation is large.
earchers who conduct a Mauchly test should
therefore examine their data for evidence of non- Samuel T. Moulton
normality and, if necessary, consider applying nor-
malizing transformations before reconducting the See also Analysis of Variance (ANOVA); Bartlett’s Test;
Mauchly test. Greenhouse–Geisser Correction; Homogeneity of
Compared with other tests of sphericity, the Variance; Homoscedasticity; Repeated Measures
Mauchly test is not the most statistically powerful. Design; Sphericity
In particular, the local invariant test (see Cornell,
Young, Seaman, & Kirk, 1992) produces fewer
Type II errors (i.e., false negative) than the Mau- Further Readings
chly test does. This power difference between the Cornell, J. E., Young, D. M., Seaman, S. L., & Kirk, R. E.
two tests is trivially small for large samples and (1992). Power comparisons of eight tests for sphericity
small k/n ratios but noteworthy for small samples in repeated measures designs. Journal of Educational
sizes and large k/n ratios. For this reason, some Statistics, 17, 233–249.
778 MBESS
Huynh, H., & Mandeville, G. K. (1979). Validity (such as the functions contained within the MBESS
conditions in repeated measures designs. Psychological package), a resulting benefit is ‘‘reproducible res-
Bulletin, 86, 964–973. earch,’’ in the sense that a record exists of the exact
Keselman, H. J., Rogan, J. C., Mendoza, J. L., & Breen, analyses performed, with all options and subsam-
L. J. (1980). Testing the validity conditions of repeated
ples denoted. Having a record of the exact analyses,
measures F tests. Psychological Bulletin, 87, 479–481.
Mauchly, J. W. (1940). Significance test for sphericity of
by way of a script file, that were performed is bene-
a normal n-variate distribution. Annals of ficial so that the data analyst can (a) respond to
Mathematical Statistics, 11, 204–209. inquiries regarding the exact analyses, algorithms,
Nagarsenker, B. N., & Pillai, K. C. S. (1973). The and options; (b) modify code for similar analyses
distribution of the sphericity test criterion. Journal of on the same or future data; and (c) provide code
Multivariate Analysis, 3, 226–235. and data so that others can replicate the published
results. Many novel statistical techniques are imple-
mented in R, and in many ways R has become nec-
essary for cutting-edge developments in statistics
MBESS and measurement. In fact, R has even been referred
to as the lingua franca of statistics.
MBESS is an R package that was developed pri- MBESS, developed by Ken Kelley, was first
marily to implement important but nonstand- released publicly in May 2006 and has since incor-
ard methods for the behavioral, educational, and porated functions contributed by others. MBESS
social sciences. The generality and applicability of will continue to be developed for the foreseeable
many of the functions contained in MBESS have future and will remain open source and freely
allowed the package to be used in a variety of available. Although only minimum experience with
other disciplines. Both MBESS and R are open R is required in order to use many of the functions
source and freely available from The R Project’s contained within the MBESS package, in order to
Comprehensive R Archive Network. The MBESS use MBESS to its maximum potential, experience
Web page contains the reference manual, source with R is desirable.
code files, and binaries files. MBESS (and R) is
available for Apple Macintosh, Microsoft Win- Ken Kelley
dows, and Unix/Linux operating systems.
See also Confidence Intervals; Effect Size, Measures of;
The major categories of functions contained in
R; Sample Size Planning
MBESS are (a) estimation of effect sizes (standard-
ized and unstandardized), (b) confidence interval
formation based on central and noncentral distri- Further Readings
butions (t, F, and χ2), (c) sample size planning
de Leeuw, J. (2005). On abandoning XLISP-STAT.
from the accuracy in parameter estimation and
Journal of Statistical Software, 13(7), 1–5.
power analytic perspectives, and (d) miscellaneous Kelley, K. (2006–2008). MBESS [computer software and
functions that allow the user to easily interact with manual]. Accessed February 16, 2010, from http://
R for analyzing and graphing data. Most MBESS cran.r-project.org/web/packages/MBESS/index.html
functions require only summary statistics. MBESS Kelley, K. (2007). Confidence intervals for standardized
thus allows researchers to compute effect sizes and effect sizes: Theory, application, and implementation.
confidence intervals based on summary statistics, Journal of Statistical Software, 20(8), 1–24.
which facilitates using previously reported infor- Kelley, K. (2007). Methods for the behavioral,
mation (e.g., for calculating effect sizes to be educational, and social science: An R package.
included in meta-analyses) or if one is primarily Behavior Research Methods, 39(4), 979–984.
Kelley, K., Lai, K., & Wu, P.-J. (2008). Using R for data
using a program other than R to analyze data but
analysis: A best practice for research. In J. Osbourne
still would like to use the functionality of MBESS. (Ed.), Best practices in quantitative methods
MBESS, like R, is based on a programming envi- (pp. 535–572). Thousand Oaks, CA: Sage.
ronment instead of a point-and-click interface for R Development Core Team. (2008). The R project for
the analysis of data. Because of the necessity to statistical computing. Retrieved February 16, 2010,
write code in order for R to implement functions from http://www.R-project.org/
McNemar’s Test 779
Venebles, W. N., Smith, D. M., & The R Development any observed change while accounting for the
Core Team. (2008). An introduction to R: Notes on R: dependent nature of the sample. To do so, a four-
A programming environment for data analysis and fold table of frequencies must be set up to repre-
graphics. Retrieved February 16, 2010, from http:// sent the first and second sets of responses from the
cran.r-project.org/doc/manuals/R-intro.pdf
same or matched individuals. This table is also
known as a 2 × 2 contingency table and is illus-
Websites trated in Table 1.
Comprehensive R Archive Network:
In this table, Cells A and D represent the discor-
http://cran.r-project.org dant pairs, or individuals whose response changed
MBESS: http://cran.r-project.org/web/packages/MBESS/ from the first to the second time. If an individual
index.html changes from þ to , he or she is included in
The R Project: http://www.r-project.org Cell A. Conversely, if the individual changes from
to þ , he or she is tallied in Cell D. Cells B and
C represent individuals who did not change
responses over time, or pairs that are in agreement.
MCNEMAR’S TEST The main purpose of McNemar’s test is determine
whether the proportion of individuals who chan-
McNemar’s test, also known as a test of corre- ged in one direction ( þ to ) is significantly dif-
lated proportions, is a nonparametric test used ferent from that of individuals who changed in the
with dichotomous nominal or ordinal data to other direction ( to þ ).
determine whether two sample proportions When one is using McNemar’s test, it is unnec-
based on the same individuals are equal. McNe- essary to calculate actual proportions. The differ-
mar’s test is used in many fields, including the ence between the proportions algebraically and
behavioral and biomedical sciences. In short, it conceptually reduces to the difference between the
is a test of symmetry between two related sam- frequencies given in A and D. McNemar’s test then
ples based on the chi-square distribution with 1 assumes that A and D belong to a binomial distri-
degree of freedom (df). bution defined by
McNemar’s test is unique in that it is the only
test that can be used when one or both conditions n ¼ A þ D; p ¼ :05; and q ¼ :05:
being studied are measured using the nominal
scale. It is often used in before–after studies, in Based on this, the expectation under the null
which the same individuals are measured at two hypothesis would be that 12(A þ D) cases would
times, a pretest–posttest, for example. McNemar’s change in one direction and 12(A þ D) cases would
test is also often used in matched-pairs studies, in change in the other direction. Therefore, Ho :
which similar people are exposed to two different A ¼ D: The χ2 formula,
conditions, such as a case–control study. This entry X ðOi Ei Þ2
details the McNemar’s test formula, provides an χ2 ¼ ;
example to illustrate the test, and examines its Ei
application in research.
where Oi ¼ observed number of cases in the ith
category and Ei ¼ expected number of cases in the
Formula ith category under H0, converts into
McNemar’s test, in its original form, was designed
ðA ðA þ DÞ=2Þ2 ðD ðA þ DÞ=2Þ2
only for dichotomous variables (i.e., yes–no, right– 2
χ ¼ þ
ðAþDÞ ðAþDÞ
wrong, effect–no effect) and therefore gives rise to 2 2
proportions. McNemar’s test is a test of the equal-
ity of these proportions to one another given the and then factors into
fact that they are based in part on the same
individual and therefore correlated. More specifi- ðA DÞ2
χ2 ¼ with df ¼ 1:
cally, McNemar’s test assesses the significance of AþD
780 McNemar’s Test
Table 1 Fourfold Table for Use in Testing Significance Table 2 Form of Table to Show Subjects’ Change in
of Change Voting Decision in Response to Negative
Campaign Ad
After
Before þ After Campaign Ad
þ A B
Before Campaign Ad No Vote Yes Vote
C D
Yes vote yn yy
This is McNemar’s test formula. The sample No vote nn ny
distribution is distributed approximately as chi-
square with 1 df.
Statistical Test
Correction for Continuity McNemar’s test is chosen to determine whether
The approximation of the sample distribution there was a significant change in voter behavior.
by the chi-square distribution can present prob- McNemar’s test is appropriate because the study
lems, especially if the expected frequencies are uses two related samples, the data are measured
small. This is because the chi-square is a continu- on a nominal scale, and the researcher is using
ous distribution whereas the sample distribution is a before–after design. McNemar’s test formula as
discrete. The correction for continuity, developed it applies to Table 2 is shown below.
by Frank Yates, is a method for removing this
source of error. It requires the subtraction of 1 ðyn nyÞ2
χ2 ¼ with df ¼ 1:
from the absolute value of the difference between yn þ ny
A and D prior to squaring. The subsequent for-
With the correction for continuity included, the
mula, including the correction for continuity,
formula becomes
becomes
2
ðjyn nyj 1Þ
ðjA Dj 1Þ2 χ2 ¼ :
χ2 ¼ with df ¼ 1: yn þ ny
AþD
Small Expected Frequencies
Hypotheses
When the expected frequency is very small
(12(A þ D) < 5), the binomial test should be used
instead. H0: For those subjects who change, the probability
that any individual will change his or her vote from
yes to no after being shown the campaign ad (that
Example is, Pyn ) is equal to the probability that the individ-
Suppose a researcher was interested in the effect ual will change his or her vote from no to yes (that
of negative political campaign messages on voting is, Pny ), which is equal to 12. More specifically,
behavior. To investigate, the researcher uses a 1
before–after design in which 65 subjects are polled H0 : Pyn ¼ Pny ¼ :
2
twice on whether they would vote for a certain
politician: before and after viewing a nega- H1: For those subjects who change, the probability
tive campaign ad discrediting that politician. The that any individual will change his or her vote
researcher hypothesizes that the negative campaign from yes to no after being shown the negative
message will reduce the number of individuals campaign ad will be significantly greater than the
who will vote for the candidate targeted by the probability that the individual will change his or
negative ad. The data are recorded in the form her vote from no to yes. In other words,
shown in Table 2. The hypothesis test follows; the
data are entirely artificial. H1 : Pyn > Pny :
McNemar’s Test 781
Sampling Distribution
Application
McNemar’s test is valuable to the behavioral and
The sampling distribution of χ2 as computed by
biomedical sciences because it gives researchers
McNemar’s test is very closely approximated by the
a way to test for significant effects in dependent
chi-square distribution with df ¼ 1. In this exam-
samples using nominal measurement. It does so by
ple, H1 predicts the direction of the difference and
reducing the difference between proportions to
therefore requires a one-tailed rejection region. This
the difference between discordant pairs and then
region consists of all the χ2 values that are so large
applying the binomial distribution. It has proven
they only have a 1% likelihood of occurring if the
useful in the study of everything from epidemiol-
null hypothesis is true. For a one-tailed test, the crit-
ogy to voting behavior, and it has been modified to
ical value with p < :01 is 7.87944.
fit more specific situations, such as misclassified
Calculation and Decision data, improved sample size estimations, multivari-
ate samples, and clustered matched-pair data.
The artificial results of the study are shown in
Table 3. The table shows that 30 subjects changed M. Ashley Morrison
their vote from yes to no (yn) after seeing the nega-
See also Chi-Square Test; Dichotomous Variable;
tive campaign ad, and 7 subjects changed their
Distribution; Nominal Scale; Nonparametric Statistics;
vote from no to yes (ny).
One-Tailed Test; Ordinal Scale
The other two cells, yy ¼ 11 and nn ¼ 17, rep-
resent those individuals who did not change their
Further Readings
vote after seeing the ad.
For these data, Bowker, A. H. (1948). A test for symmetry in
contingency tables. Journal of the American Statistical
ðyn nyÞ2 ð30 7Þ2 ð23Þ2 Association, 43, 572–574.
χ2 ¼ ¼ ¼ ¼ 14:30: Eliasziw, M., & Donner, A. (1991). Application of the
yn þ ny ð30 þ 7Þ 37
McNemar test to non-independent matched pair data.
Including the correction for continuity: Statistics in Medicine, 10, 1981–1991.
Hays, W. L. (1994). Statistics (5th ed.). Orlando, FL:
ðjyn ny 1Þ2 j ðj30 7j 1Þ2 Harcourt Brace.
χ2 ¼ ¼ Klingenberg, B., & Agresti, A. (2006). Multivariate
yn þ ny 30 þ 7 extensions of McNemar’s test. Biometrics, 62,
ð22Þ2 921–928.
¼ ¼ 13:08: Levin, J. R., & Serlin, R. C. (2000). Changing students’
37
perspectives of McNemar’s test of change. Journal of
The critical χ2 value for a one-tailed test at Statistics Education [online], 8(2). Retrieved February
α ¼ :01 is 7.87944. Both 14.30 and 13.08 are 16, 2010, from www.amstat.org/publications/jse/
greater than 7.87944; therefore, the null hypothe- secure/v8n2/levin.cfm
sis is rejected. These results support the resear- Lyles, R. H., Williamson, J. M., Lin, H. M., & Heilig,
C. M. (2005). Extending McNemar’s test: Estimation
and inference when paired binary outcome data are
Table 3 Subjects’ Voting Decision Before and After misclassified. Biometrics, 61, 287–294.
Seeing Negative Campaign Ad McNemar, Q. (1947). Note on sampling error of the
difference between correlated proportions or
After Campaign Ad
percentages. Psychometrika, 12, 153–157.
Before Campaign Ad No Vote Yes Vote Satten, G. A., & Kupper, L. L. (1990). Sample size
Yes vote 30 11 determination for matched-pair case-control studies
No vote 17 7 where the goal is interval estimation of the odds ratio.
Journal of Clinical Epidemiology, 43, 55–59.
782 Mean
Z ∞
Siegel, S. (1956). Nonparametric statistics. New York:
McGraw-Hill. jyjf ðyÞdy < ∞: ð2Þ
∞
Yates, F. (1934). Contingency tables involving small
numbers and the χ2 test. Supplement to Journal of the
Royal Statistical Society, 1, 217–235. Comparing Equation 1 with Equation 2, one
notices immediately that the f(y)dy in Equation 2
mirrors the p(y) in Equation 1, and the integration
in Equation 2 is analogous to the summation in
MEAN Equation 1.
The above definitions help to understand con-
The mean is a parameter that measures the central ceptually the expected value, or the population
location of the distribution of a random variable mean. However, they are seldom used in research
and is an important statistic that is widely repor- to derive the population mean. This is because in
ted in scientific literature. Although the arithmetic most circumstances, either the size of the popula-
mean is the most commonly used statistic in des- tion (discrete random variables) or the true pro-
cribing the central location of the sample data, bability density function (continuous random
other variations of it, such as the truncated mean, variables) is unknown, or the size of the popula-
the interquartile mean, and the geometric mean, tion is so large that it becomes impractical to
may be better suited in a given circumstance. The observe the entire population. The population
characteristics of the data dictate which one of mean is thus an unknown quantity.
them should be used. Regardless of which mean is In statistics, a sample is often taken to estimate
used, the sample mean remains a random variable. the population mean. Results derived from data
It varies with each sample that is taken from the are thus called statistics (in contrast to what are
same population. This entry discusses the use of called parameters in populations). If the distribu-
mean in probability and statistics, differentiates tion of a random variable is known, a probability
between the arithmetic mean and its variations, model may be fitted to the sample data. The popu-
and examines how to determine its appropriate- lation mean is then estimated from the model
ness to the data. parameters. For instance, if a sample can be fitted
with a normal probability distribution model with
parameters μ and σ, the population mean is simply
Use in Probability and Statistics estimated by the parameter μ (and σ 2 as the vari-
In probability, the mean is a parameter that mea- ance). If the sample can be fitted with a Gamma
sures the central location of the distribution of distribution with parameters α and β, the popula-
a random variable. For a real-valued random vari- tion mean is estimated by the product of α and β
able, the mean, or more appropriately the popula- (i.e., αβ), with αβ2 as the variance. For an expo-
tion mean, is the expected value of the random nential random variable with parameter β, the
variable. That is to say, if one observes the random population mean is simply the β, with β2 as the
variable numerous times, the observed values of variance. For a chi-square ðχ2 Þ random variable
the random variable would converge in probability with v degrees of freedom, the population mean is
to the mean. For a discrete random variable with v, with 2v being the variance.
a probability function p(y), the expected value
exists if
Arithmetic Mean
X
ypðyÞ < ∞; ð1Þ When the sample data are not fitted with a known
y probability model, the population mean is often
inferred from the sample mean, a common prac-
where y is the values assigned by the random vari- tice in applied research. The most widely used
able. For a continuous random variable with a sample mean for estimating the population mean
probability density function f ðyÞ, the expected is the arithmetic mean, which is calculated as the
value exists if sum of the observed values of a random variable
Mean 783
divided by the number of observations in the either in interval or in ratio scale. For ordinal
sample. data, the arithmetic mean is not always the most
Formally, for a sample of n observations, appropriate measure of the central location; the
x1 ; x2 ; . . . ; xn on a random variable X, the arith- median is, because it does not require the summa-
metic mean (x) of the sample is defined as tion operation.
Notice further that in Equation 3, each observa-
1 1X n
tion is given an equal weight. Consequently, the
x ¼ ðx1 þ x2 þ þ xn Þ ¼ xi ; ð3Þ
n n i¼1 arithmetic mean is highly susceptible to extreme
values. Extreme low values would underestimate
P
n the mean, while extreme high values would inflate
where the notation is a succinct representation the mean. One must keep this property of the
i¼1
of the summation of all values from the first to the sample arithmetic mean in mind when using it to
last observation of the sample. For example, a describe research results.
sample consisting of five observations with Because the arithmetic mean is susceptible to
values of 4, 5, 2, 6, and 3 has a mean variability in the sample data, it is often insuffi-
4½¼ ð4 þ 5 þ 2 þ 6 þ 3Þ=5 according to the above cient to report only the sample mean without also
definition. A key property of the mean as defined showing the sample standard deviation. Whereas
above is that the sum of deviations from it is zero. the mean describes the central location of the
If data are grouped, the sample mean can no data, the standard deviation provides information
longer be constructed from each individual mea- about the variability of the data. Two sets of data
surement. Instead, it is defined using the midvalue with the same sample mean, but drastically differ-
of each group interval (xj ) and the corresponding ent standard deviations, inform the reader that
frequency of the group (fj ): either they come from two different populations
or they suffer from variability in quality control in
1X m
the data collection process. Therefore, by reporting
x¼ fj xj , ð4Þ both statistics, one informs the reader of not only
n j¼1
the quality of the data but also the appropriateness
of using these statistics to describe the data, as well
where m is the number of groups, and n is the
as the appropriate choice of statistical methods to
total number of observations in the sample. In
analyze these data subsequently.
Equation 4, fj xj is the total value for the jth group.
A summation of the values of all groups is then
the grand total of the sample, which is equivalent Appropriateness
to the value obtained through summation, as
Whether the mean is an appropriate or inappropri-
defined in Equation 3. For instance, a sample of
ate statistic to describe the data is best illustrated
(n ¼ ) 20 observations is divided into three groups.
by examples of some highly skewed sample data,
The intervals for the three groups are 5 to 9
such as data on the salaries of a corporation, on
(x1 ¼ 7), 10 to 14 (x2 ¼ 12), and 15 to 19
the house prices in a region, on the total family
(x3 ¼ 17), respectively. The corresponding fre-
income in a nation, and so forth. These types of
quency for each group is (f1 ¼ ) 6, (f2 ¼ ) 5, and
social economic data are often distorted by a few
(f3 ¼ ) 9. The sample mean according to Equation
high-income earners or a few high-end properties.
4 is then
The mean is thus an inappropriate statistic to
7 6 þ 12 5 þ 17 9 describe the central location of the data, and the
x¼ ¼ 12:75: median would be a better statistic for the purpose.
20
On the other hand, if one is interested in describ-
Notice that in Equation 3, we summed up the ing the height or the test score of students in
values of all individual observations before arriv- a school, the sample mean would be a good
ing at the sample mean. The summation process is description of the central tendency of the popula-
an arithmetic operation on the data. This requires tion as these types of data often follow a unimodal
that the data be continuous, that is, they must be symmetric distribution.
784 Mean
2 X
3n=4
x¼ xi , ð6Þ Analogous to the summation representation
n i¼ðn=4Þþ1 Q
n
in Equation 3, the notation is a succinct
i¼1
Mean Comparisons 785
of the relationship between group membership variable but does not describe it; post hoc exami-
and the dependent variable (in the case of correla- nations must be undertaken to understand the
tional effect sizes) or the percentage of variance nature of the relationship. The eta-squared coeffi-
accounted for in the dependent variable by group cient tells the proportion of variance that can be
membership. Most of these effect sizes are insensi- accounted for by group membership.
tive to the coding of the group membership vari-
able. The three main types of effect sizes are the
point-biserial correlation, the eta (or eta-squared) Multiple R
coefficient, and multiple R (or R2). The multiple R or R2 is the effect size derived
from multiple regression techniques. Like a correla-
Point-Biserial Correlation tion or the eta coefficient, the multiple R tells the
Of the effect sizes mentioned here, the point- magnitude of the relationship between the set of
biserial correlation is the only correlational effect categorical independent variables and the continu-
size whose sign is dependent on the coding of the ous dependent variable. Also similarly, the R2 is
categorical variable. It is also the only measure the proportion of variance in the dependent vari-
presented here that requires that there be only two able accounted for by the set of categorical inde-
groups in the categorical variable. The equation to pendent variables. The magnitude for the multiple
compute the point-biserial correlation is R (and R2) will be equal to the eta (and eta2) for
the full ANOVA model; however, substantial post
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
M1 M2 n1 n2 hoc tests are unnecessary in a multiple regression
rpb ¼ , ð3Þ framework, because careful interpretation of the
SDTot NðN 1Þ
regression weights can describe the nature of the
mean differences.
where SDTot is the total standard deviation across
all groups; n1 and n2 are the sample sizes for
Groups 1 and 2, respectively; and N is total sam-
ple size across the two groups. Though it is a stan- Additional Issues
dard Pearson correlation, it does not range from As research in the social sciences increases at an
zero to ± 1.00; the maximum absolute value is exponential rate, cumulating research findings
about 0.78. The point-biserial correlation is also across studies becomes increasingly important. In
sensitive to the proportion of people in each this context, knowing whether means are statisti-
group; if the proportion of people in each group cally different becomes less important, and docu-
differs substantially from 50%, the maximum menting the magnitude of the difference between
value drops even further away from 1.00. means becomes more important. As such, the
reporting of effect sizes is imperative to allow
Eta
proper accumulation across studies. Unfortunately,
Although the eta coefficient can be interpreted current data accumulation (i.e., meta-analytic)
as a correlation, it is not a form of a Pearson corre- methods require that a single continuous depen-
lation. While the correlation is a measure of the dent variable be compared on a single dichoto-
linear relationship between variables, the eta actu- mous independent variable. Fortunately, although
ally measures any relationship between the cate- multiple estimates of these effect sizes exist, they
gorical independent variable and the continuous can readily be converted to one another. In addi-
dependent variable. Eta-squared is the square of tion, many of the statistical tests can be converted
the eta coefficient and is the ratio of the between- to an appropriate effect size measure.
group variance to the total variance. The eta (and Converting between a point-biserial correla-
eta-squared) can be computed with any number of tion and a standardized mean difference is rela-
independent variables and any number of cate- tively easy if one of them is already available.
gories in each of those categorical variables. The For example, the formula for the conversion of
eta tells the magnitude of the relationship between a point-biserial correlation to a standardized
the categorical variable(s) and the dependent mean difference is
790 Median
The population median, like the population data in interval and ratio scale because it requires
mean, is generally unknown. It must be inferred first the summation of all values in a sample.
from the sample median, just like the use of the The sample median as defined in Equation 1 is
sample mean for inferring the population mean. difficult to use when a population consists of all
In circumstances in which a sample can be fitted integers and a sample is taken with an even num-
with a known probability model, the population ber of observations. Because the median should
median may be obtained directly from the model also be an integer, two medians could result. One
parameters. For instance, a random variable that may be called the lower median and the other, the
follows an exponential distribution with a scale upper median. To avoid calling two medians of
parameter β (a scale parameter is the one that a single sample, an alternative is simply to call the
stretches or shrinks a distribution), the median is upper median the sample median, ignoring the
βln2 (where ln means natural logarithm, which lower one.
has a base e ¼ 2.718281828). If it follows a normal If the sample data are grouped into classes, the
distribution with a location parameter μ and value of the sample median cannot be obtained
a scale parameter σ, the median is μ. For a random according to Equation 1 as the individual values of
variable following a Weibull distribution with a the sample are no longer available. Under such
location parameter μ, a scale parameter α, and a circumstance, the median is calculated for the
a shape parameter γ (a shape parameter is the one particular class that contains the median. Two dif-
that changes the shape of a distribution), the ferent approaches may be taken to achieve the
median is μ þ α (ln2)1/γ . However, not all distribu- same result. One approach starts with the fre-
tions have a median in closed form. Their popula- quency and cumulative frequency (see Ott and
tion median cannot be obtained directly from Mendenhall, 1994):
a probability model but has to be estimated from
w n
the sample median. m¼Lþ cfb , ð2aÞ
fm 2
Definition and Calculation where m ¼ the median, L ¼ lower limit of the class
that contains the median, n ¼ total number of
The sample median can be defined similarly, irre-
observations in the sample, cfb ¼ cumulative fre-
spective of the underlying probability distribution
quency for all classes before the class that contains
of a random variable. For a sample of n observa-
the median, fm ¼ frequency for the class that con-
tions, x1 ; x2 ; . . . xn ; taken from a random variable
tains the median, and w ¼ interval width of the
X, rank these observations in an ascending order
classes.
from the smallest to the largest in value; the sam-
The other approach starts with the percentage
ple median, m, is defined as
and cumulative percentage:
xk if n ¼ 2k þ 1 w
m¼ : ð1Þ m¼Lþ ð50 cPb Þ, ð2bÞ
ðxk þ xkþ1 Þ=2 if n ¼ 2k Pm
That is, the sample median is the value of the where 50 ¼ the 50th percentile, cPb ¼ cumulative
middle observation of the ordered statistics if the percentage for all classes before the class that con-
number of observations is odd or the average of tains the median, and Pm ¼ percentage of the class
the value of the two central observations if the that contains the median. Both L and w are
number of observations is even. This is the most defined as in Equation 2a. A more detailed descrip-
widely used definition of the sample median. tion of this approach can be found in Arguing
According to Equation 1, the sample median is With Numbers by Paul Gingrich.
obtained from order statistics. No arithmetical sum- This second approach is essentially a special
mation is involved, in contrast to the operation of case of the approach used to interpolate the dis-
obtaining the sample mean. The sample median can tance to a given percentile in grouped data. To do
therefore be used on data in ordinal, interval, and so, one needs only to replace the 50 in Equation
ratio scale, whereas the sample mean is best used on 2b with a percentile of interest. The percentile
792 Median
within an interval of interest can then be interpo- similar to the one above. One can then apply
lated from the lower percentile bound of the inter- either Equation 2a or Equation 2b to find the class
val width. that contains the median of the responses.
To show the usage of Equations 2a and 2b, con-
sider the scores of 50 participants in a hypothetical
contest, which are assigned into five classes with Determining Which Central
the class interval width ¼ 20 (see Table 1). A
Tendency Measure to Use
glance at the cumulative percentage in the right-
most column of the table suggests that the median As pointed out at the beginning, the mean, the
falls in the 61-to-80 class because it contains the median, and the mode are all location para-
50th percentile of the sample population. Accord- meters that measure the central tendency of
ing to Equation 2a, therefore, L ¼ 61, n ¼ 50, a sample. Which one of them should be used for
cfb ¼ 13, fm ¼ 20, and w ¼ 20. The interpolated reporting a scientific study? The answer to this
value for the median then is question depends on the characteristics of the
data, or more specifically, on the skewness (i.e.,
20 × ð50=2 13Þ the asymmetry of distribution) of the data. A dis-
m ¼ 61 þ ¼ 73:
20 tribution is said to be skewed if one of its two
tails is longer than the other. In statistics litera-
Using Equation 2b, we have cPb ¼ 26, Pm ¼ 40, ture, the mean (μ), median (m), and mode (M)
both L ¼ 61, and w ¼ 20, as before. The interpo- inequality are well known for both continuous
lated value for the median of the 50 scores, then, is and discrete unimodal distributions. The three
statistics occur either in the order of M ≤
20 × ð50 26Þ m ≤ μ or in a reverse order of M ≥ m ≥ μ,
m ¼ 61 þ ¼ 73:
40 depending on whether the random variable is
positively or negatively skewed.
Equations 2a and 2b are equally applicable to For random variables that follow a symmetric
ordinal data. For instance, in a survey on the qual- distribution such as the normal distribution, the
ity of customer services, the answers to the cus- sample mean, median, and mode are equal and
tomer satisfaction question may be scored as can all be used to describe the sample central ten-
dissatisfactory, fairly satisfactory, satisfactory, and dency. Despite this, the median, as well as the
strongly satisfactory. Assign a value of 1, 2, 3, or 4 mode, of a normal distribution is not used as fre-
(or any other ordered integers) to represent each of quently as the mean. This is because the variability
these classes from dissatisfactory to strongly satis- (V) associated with the sample mean is much smal-
factory and summarize the number of responses ler than the V associated with the sample median
corresponding to each of these classes in a table (V[m] ¼ [1.2533]2V[μ]). If a random variable fol-
lows a skewed (i.e., nonsymmetrical) distribution,
the sample mean, median, and mode are not equal.
Table 1 Grouped Data for 50 Hypothetical Test
The median differs substantially from both the
Scores
arithmetic mean and the mode and is a better mea-
Frequency sure of the central tendency of a random sample
Cumulative because the median is the minimizer of the mean
Number of Number of absolute deviation in a sample. Take a sample set
Class Observations Observations % Cumulative % {2, 2, 3, 3, 3, 4, 15} as an example. The median is
3 (so is the mode), which is a far better measure of
0–20 2 2 4 4
the centrality of the data set than the arithmetic
21–40 5 7 10 14
mean of 4.57. The latter is largely influenced by
41–60 6 13 12 26
the last extreme value, 15, and does not ade-
61–80 20 33 40 66
quately describe the central tendency of the data
81–100 17 50 34 100
set. From this simple illustration, it can be con-
50 100
cluded that the sample median should be favored
Meta-Analysis 793
over the arithmetic sample mean in describing When one is using the sample median, it helps
the centrality whenever the distribution of a ran- to remember its four important characteristics,
dom variable is skewed. Examples of such as pointed out by Lyman Ott and William
skewed data can be found frequently in eco- Mendenhall:
nomic, sociological, education, and health stud-
ies. A few examples are the salary of employees 1. The median is the central value of a data set,
of a large corporation, the net income of house- with half of the set above it and half below it.
holds in a city, the house price in a country, and 2. The median is between the largest and the
the survival time of cancer patients. A few high- smallest value of the set.
income earners, or a few high-end properties, or
3. The median is free of the influence of extreme
a few longer survivors could skew their respec- values of the set.
tive sample disproportionally. Use of the sample
mean or mode to represent the data centrality 4. Only one median exists for the set (except in the
would be inappropriate. difficult case in which an even number of
The median is sometimes called the average. observations is taken from a population
consisting of only integers).
This term may be confused with the mean for
some people who are not familiar with a specific Shihe Fan
subject in which this interchangeable usage is fre-
quent. In scientific reporting, this interchangeable See also Central Tendency, Measures of; Mean; Mode
use is better avoided.
Further Readings
Abdous, B., & Theodorescu, R. (1998). Mean, median,
Advantages mode IV. Statistica Neerlandica, 52, 356–359.
Gingrich, P. (1995). Arguing with numbers: Statistics for
Compared with the sample mean, the sample the social sciences. Halifax, Nova Scotia, Canada:
median has two clear advantages in measuring Fernwood.
the central tendency of a sample. The first Groneveld, A., & Meeden, G. (1977). The mode,
advantage is that the median can be used for all median, and mean inequality. American Statistician,
data measured in ordinal, interval, and ratio 31, 120–121.
scale because it does not involve the mathematic Joag-Dev, K. (1989). MAD property of a median: A
operation of summation, whereas the mean is simple proof. American Statistician, 43, 26–27.
best used for data measured in interval and ratio MacGillivray, H. L. (1981). The mean, median, mode
scale. The second advantage is that the median inequality and skewness for a class of densities.
Australian Journal of Statistics, 23, 247–250.
gives a measure of central tendency that is more
Ott, L., & Mendenhall, W. (1994). Understanding
robust than the mean if outlier values are present statistics (6th ed.). Pacific Grove, CA: Duxbury.
in the data set because it is not affected by Wackerly, D. D, Mendenhall, W., III, & Scheaffer, R. L.
whether the distribution of a random variable is (2002). Mathematical statistics with applications (6th
skewed. In fact, the median, not the mean, is ed.). Pacific Grove, CA: Duxbury.
a preferred parameter in describing the central
tendency of such random variables when their
distribution is skewed. Therefore, whether to use
the sample median as a central tendency measure META-ANALYSIS
depends on the data type. The median is used if
a random variable is measured in ordinal scale Meta-analysis is a statistical method that integrates
or if a random variable produces extreme values the results of several independent studies consid-
in a set. In contrast, the mean is a better measure ered to be ‘‘combinable.’’ It has become one of the
of the sample central tendency if a random vari- major tools to integrate research findings in social
able is continuous and is measured in interval or and medical sciences in general and in education
ratio scale, and if data arising from the random and psychology in particular. Although the history
variable contain no extreme value. of meta-analytic procedures goes all the way back
794 Meta-Analysis
to the early 1900s and the work of Karl Pearson address multiple hypotheses. It may examine the
and others, who devised statistical tools to com- relation between several variables and account for
pare studies from different samples, Gene V. Glass consistencies as well as inconsistencies within
coined the term in 1976. Glass, Barry McGaw, a sample of study findings. Because of demand for
and Mary Lee Smith described the essential char- robust research findings and with the advance of
acteristics of meta-analysis as follows: statistical procedures, meta-analysis has become
one of the major tools for integrating research
1. It is undeniably quantitative, that is, it uses findings in social and medical science as well as
numbers and statistical methods for organizing the field of education, where it originated. A recent
and extracting information.
search of the ERIC database identified more than
2. It does not prejudge research findings in terms 618 articles published between 1980 and 2000
of research quality (i.e., no a priori arbitrary that use meta-analysis in their title, as opposed to
and nonempirical criteria of research quality are only 36 written before 1980. In the field of psy-
imposed to exclude a large number of studies). chology, the gap was 12 versus 1,623, and in the
3. It seeks general conclusions from many separate field of medical studies, the difference is even more
investigations that address related or identical striking: 7 versus 3,571. Evidence in other fields
hypotheses. shows the same trend toward meta-analysis’s
becoming one of the main tools for evidence-based
Meta-analysis involves developing concise crite- research.
ria for inclusion (i.e., sampling), searching the liter- According to the publication manual of the
ature for relevant studies (i.e., recruitment), coding American Psychological Association, a review
study variables (i.e., data entry), calculating stan- article organizes, integrates, and critically evalu-
dardized effect sizes for individual studies, and gen- ates already published material. Meta-analysis
erating an overall effect size across studies (i.e., is only one way of reviewing or summarizing
data analysis). Unlike primary studies, in which research literature. Narrative review is the more
each case in a sample is a unit of analysis, the unit traditional way of reviewing research literature.
of analysis for meta-analysis is the individual study. There are several differences between traditio-
The effect sizes calculated from the data in an indi- nal narrative reviews and meta-analysis. First,
vidual study are analogous to the dependent because there are very few systematic proce-
variable, and the substantive and methodological dures, the narrative review is more susceptible to
characteristics affecting the study results are defined subjective bias and therefore more prone to error
as independent variables. Any standardized index than are meta-analytic reviews. In the absence of
that can be used to understand different statistical formal guidelines, reviewers of a certain litera-
findings across studies in a common metric can be ture can disagree about many critical issues, such
used as an ‘‘effect size.’’ The effect size metric repre- as which studies to include and how to support
sents both the magnitude and direction of the rela- conclusions with a certain degree of quantitative
tion of interest across different primary studies in evidence. In an adequately presented meta-
a standardized metric. A variety of alternatives are analytical study, one should be able to replicate
available for use with variables that are either con- the review by following the procedure reported
tinuous or discrete, such as the accumulation of in the study.
correlations (effect size r), and standardized differ- Narrative review and meta-analysis are also dif-
ences between mean scores (effect size d), p values, ferent in terms of the scope of the studies that they
or z scores effect size (ES). The dependent variable can review. The narrative review can be inefficient
in meta-analysis is computed by transforming find- for reviewing 50 or more studies. This is especially
ings of each reviewed study into a common metric true when the reviewer wants to go beyond
that relies on either r or d as the combined statistic. describing the findings and explain multiple rela-
Meta-analysis is not limited to descriptive revi- tions among different variables. Unlike narrative
ews of research results but can also examine how reviews, meta-analysis can put together all avail-
and why such findings occur. With the use of mul- able data to answer questions about overall study
tivariate statistical applications, meta-analysis can findings and how they can be accounted for by
Meta-Analysis 795
various factors, such as sample and study charac- less likely to show statistically nonsignificant
teristics. Meta-analysis can, therefore, lead to the results, may introduce a bias in the conclusions of
identification of various theoretical and empirical meta-analyses. Not only can the decision of
factors that may permit a more accurate whether to include unpublished studies lead to
understanding of the issues being reviewed. Thus, bias, but decisions about how to obtain data and
although meta-analysis can provide a better assess- which studies to include can also contribute to
ment of literature because it is more objective, rep- selection bias. The unpublished studies that can be
licable, and systematic, it is important to note that located may thus be an unrepresentative sample of
a narrative description for each study is key to any unpublished studies. A review of meta-analytical
good meta-analytic review as it will help a studies published between 1988 and 1991 indi-
meta-analyst determine which studies to include cated that most researchers had searched for
and what qualitative information about the studies unpublished material, yet only 31% included
can and should be coded and statically related to unpublished studies in their review. Although most
quantitative outcomes in order to evaluate the of these researchers supported the idea of includ-
complexity of topics being reviewed. ing unpublished data in meta-analysis, only 47%
The remainder of this entry addresses the meth- of journal editors supported this practice.
odological issues associated with meta-analytic The nonindependence problem in meta-analysis
research and then describes the steps involved in refers to the assumption that each study in the
conducting meta-analysis. review is taken randomly from a common popula-
tion; that the individual studies are independent of
one another. ‘‘Lumping’’ sets of independent studies
Methodological Issues
can reduce the reliability of estimations of averages
Like any other research strategy, meta-analysis is or regression equations. Although Glass and his
not a perfect solution in research review. Glass associates argued that the nonindependence
summarized the main issues with meta-analysis in assumption is a matter of practicality, they admit
four domains: quality, commensurability, selection that this problem is the one criticism that is not
bias, and nonindependence. The quality problem ‘‘off the mark and shallow’’ (Glass et al., p. 229).
has been a very controversial issue in meta-analy-
sis. At issue is whether the quality of studies be
Steps
included as a selection criteria. To avoid any bias
in selection one option is to include as many stud- Although meta-analytic reviews can take different
ies as possible, regardless of their quality. Others, forms depending on the field of study and the
however, question the practice of including studies focus of the review, there are five general steps in
of poor quality as doing so limits the validity of conducting meta-analysis. The first step invol-
the overall conclusions of the review. The com- ves defining and clarifying the research question,
mensurability problem refers to the most common which includes selecting inclusion criteria. Similar
criticism of meta-analysis: that it compares apples to selecting a sample for an empirical study, inclu-
and oranges. In other words, meta-analysis is illog- sion criteria for a meta-analysis have to be speci-
ical because it mixes constructs from studies that fied following a theoretical or empirical guideline.
are not the same. The inclusion criteria greatly effect the conclusions
The selection-bias problem refers to the inevita- drawn from a meta-analytic review. Moreover, the
ble scrutiny of the claim that the meta-analytic inclusion criteria are one of the steps in a meta-
review is comprehensive and nonbiased in its analytic study where bias or subjectivity comes
reviewing process. Meta-analysis is not inherently into play. Two critical issues should be addressed
immune from selection bias, as its findings will be at this stage of meta-analysis: (a) Should unpub-
biased if there are systematic differences across lished studies be included? and (b) should the qual-
journal articles, book articles, and unpublished ity of the studies be included as part of the
articles. Publication bias is a major threat to the inclusion criteria? There are no clear answers to
validity of meta-analysis. The file drawer effect, these questions. Glass and colleagues, for example,
which refers to the fact that published studies are argued against strict inclusion criteria based on
796 Meta-Analysis
assessing study quality a priori because a meta- articles to see if there are relevant studies that have
analysis itself can empirically determine whether not yet been included in the final pool. Each of
study quality is related to variance in reported these postelectronic search steps can also serve as
study findings. While Glass and others argued for a reliability check to see whether the original
inclusion of all studies, including unpublished search code works well. In other words, if there
reports in order to avoid publication bias toward are too many articles that were not part of the
null findings in the literature, it is possible to electronically searched pool, then it is possible that
empirically assess research quality with a set of the search code was not a valid tool to identify rel-
methodological variables as part of the meta- evant studies for the review. In those circumstances
analytic data analysis. In other words, instead of a modified search would be in order.
eliminating a study based on the reviewer’s judg- The third step in meta-analysis is the develop-
ment of its quality, one can empirically test the ment of a coding schema. The goal of study coding
impact of study quality as a control or moderator is to develop a systematic procedure for recording
variable. the appropriate data elements from each study.
The next step in meta-analysis is to identify William A. Stock identified six categories of study
studies to be included in the review. This step elements for systematic coding that address both
involves a careful literature search that involves substantive and methodological characteristics:
computerized and manual approaches. Computer- report identification (study identifiers such as year
ized search approaches include using discipline of publication, authors), setting (the location or
specific databases such as PsycINFO in psychol- context of the study), subjects (participant charac-
ogy, ERIC in education, MEDLINE in medical teristics), methodology (research design charac-
sciences, or Sociological Abstracts in sociology. teristics), treatment (procedures), and effect size
Increasingly, searching the Internet with search (statistical data needed to calculate common effect
engines such as Google (or Google Scholar) also size). One can modify these basic categories
helps identify relevant studies for meta-analytic according to the specific focus of the review and
review. All databases must be searched with the with attention to the overall meta-analytic ques-
same set of keywords and search criteria in order tion and potential moderator factors. To further
to ensure reliability across the databases. It is also refine the coding scheme, a small subsample of the
important to keep in mind that several vendors data (k ¼ 10) must be piloted with two raters who
market the most popular databases, such as Psyc- did not take part in the creation of the coding
INFO, and each vendor has a different set of schema.
defaults that determine the outcome of any search. The next step is to calculate effect sizes for each
It is, therefore, advisable for investigators to gener- study by transforming individual study statistics
ate a single yet detailed logical search code and test into a common effect size metric. The goal of
it by using various vendors to see if their databases effect size transformation is to reflect with a com-
yield the same result. mon metric the relative magnitude of the relations
Although computerized search engines save reported in various independent studies. The three
time and make it possible to identify relevant most commonly used effect size metrics in meta-
materials in large databases, they should be com- analytic reviews are Cohen’s d, correlation coeffi-
plemented with additional search strategies, cient r, and odds ratio. Cohen’s d, or effect size d,
including manual search. In fields in which there is is a metric that is used when the research involves
no universally agreed-on keyword, for example, mean differences or group contrasts. This is
one can search key publications or citations of a method used in treatment studies or any design
classic articles using the Social Science Citation that calls for calculating standardized mean differ-
Index, which keeps track of unique citations of ences across groups in a variable that is continuous
each published article. If narrative reviews have in nature. Correlation coefficient r can also serve
been published recently, one can also check the as an effect size metric (or effect size r) when the
cited articles in those reviews. Finally, once the focus of the review is identification of the direction
final review pool is determined, one must also and magnitude of the association between vari-
manually check the references in each of the ables. Odds-ratio effect size is commonly used in
Meta-Analysis 797
epidemiological reviews or in reviews that involve heterogeneous, that is, between-study differences
discontinuous variables (e.g., school dropout or are due to unobserved random sources, an effort
diagnosis of a certain condition). must be made to identify sample and study charac-
The calculation of an effect size index also teristics that explain the difference across the stud-
requires a decision about the unit of analysis in ies through the coding process. When combining
a meta-analysis. There are two alternatives. The the outcomes from different studies, one may also
first alternative is to enter the effect size for each choose to use a fixed or random-effects model.
variable separately. For example, if a study reports The fixed-effect model assumes that studies in the
one correlation on the basis of grade point average meta-analysis use identical methods, whereas the
and another correlation on the basis of an achieve- random-effects model assumes that studies are a
ment test score, there will be two different effect random sample from the universe of all possible
sizes for the study, one for grade point average and studies. The former model considers within-study
the other for achievement test score. Similarly, if variability as the only source of variation, while
correlations were reported for girls and boys sepa- the latter model considers both within-study and
rately, there will be two effect sizes, one for girls between-study variation as sources of differences.
and one for boys. The second alternative is to use Fixed and random-effects models can yield very
each study as the unit of analysis. This can be done different results because fixed-effect models are
by averaging effect sizes across the groups. For likely to underestimate and random-effect models
example, one could take the mean of the correla- are likely to overestimate error variance when their
tions for girls and boys and report a single effect assumptions are violated.
size. Both of these approaches have their shortcom- Thus, meta-analysis, like any other survey res-
ings. The former approach gives too much weight earch undertaking, is an observational study of
to those studies that have more outcome measures, evidence. It has its own limitations and therefore
but the latter approach obscures legitimate theoret- should be undertaken rigorously by using well-
ical and empirical differences across dependent defined criteria for selecting and coding individual
measures (i.e., gender differences may serve as a studies, estimating effect size, aggregating signifi-
moderator in certain meta-analytic reviews). cance levels, and integrating effects.
Mark W. Lipsey and David B. Wilson suggest
a third alternative that involves calculating an effect Selcuk R. Sirin
size for each independent sample when the focus of
See also Cohen’s d Statistic; Effect Size, Measures of;
analysis is the sample characteristics (e.g., age, gen-
Fixed-Effects Models; Homogeneity of Variance;
der, race) but allowing for multiple effect sizes from
Inclusion Criteria; ‘‘Meta-Analysis of Psychotherapy
a given study when the focus of the analysis is the
Outcome Studies’’; Mixed- and Random-Effects
study characteristics (e.g., multiple indicators of the
Models; Odds Ratio; Random-Effects Models
same construct). In other words, the first alternative
can be used to calculate an effect size for each dis-
tinct construct in a particular study; this alternative Further Readings
yields specific information for each particular con- American Psychological Association. (2009).
struct being reviewed. The second alternative can Publication manual of the American
be used to answer meta-analytic questions regard- Psychological Association (6th ed.). Washington,
ing sample characteristics, as well as to calculate DC: Author.
the overall magnitude of the correlation. Cook, D. J., Guyatt, G. H., Ryan, G., Clifton. J.,
The final step in meta-analysis involves testing Buckingham, L., Willan, A., et al. (1993). Should
the homogeneity of effect sizes across studies. The unpublished data be included in meta-analyses?
Current convictions and controversies. JAMA, 269,
variation among study effect sizes can be analyzed
2749–2753.
using Hedges’s Q test of homogeneity. If studies in Cook, T. D., & Leviton, L. C. (1980). Reviewing the
meta-analysis provide a homogeneous estimate of literature: A comparison of traditional methods with
a combined effect size across studies, then it is meta-analysis. Journal of Personality, 48, 449–472.
more likely that the various studies are testing the Cooper, H. M. (1989). Integrating research: A guide for
same hypothesis. However, if these estimates are literature reviews (2nd ed.). Newbury Park, CA: Sage.
798 ‘‘Meta-Analysis of Psychotherapy Outcome Studies’’
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta- spontaneous remission of psychological symptoms
analysis in social research. Beverly Hills, CA: Sage. rather than to the therapy applied. His charge
Hedges, L. V., & Olkin, I. (1985). Statistical methods for prompted numerous studies on the efficacy of
meta-analysis. New York: Academic Press. treatment, often resulting in variable and conflict-
Light, R. J., & Pillemer, D. B. (1984). Summing up: The
ing findings.
science of reviewing research. Cambridge, MA:
Harvard University Press.
Prior to the Smith and Glass article, behavioral
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta- researchers were forced to rely on a narrative syn-
analysis. Thousand Oaks, CA: Sage. thesis of results or on an imprecise tallying method
Rosenthal, R. (1991). Meta-analytic procedures for social to compare outcome studies. Researchers from
research (Rev. ed.). Newbury Park, CA: Sage. various theoretical perspectives highlighted studies
Stock, W. A. (1994). Systematic coding for research that supported their work and dismissed or disre-
synthesis. In H. Cooper & L. V. Hedges (Eds.), The garded findings that countered their position. With
handbook of research synthesis (pp. 125–138). New the addition of meta-analysis to the repertoire of
York: Russell Sage.
evaluation tools, however, researchers were able to
objectively evaluate and refine their understanding
of the effects of psychotherapy and other behav-
ioral interventions. Smith and Glass determined
‘‘META-ANALYSIS OF that, on average, an individual who had partici-
PSYCHOTHERAPY pated in psychotherapy was better off than 75%
of those who were not treated. Reanalyses of the
OUTCOME STUDIES’’ Smith and Glass data, as well as more recent meta-
analytic studies, have yielded similar results.
The article ‘‘Meta-Analysis of Psychotherapy Out-
come Studies,’’ written by Mary Lee Smith and
Gene Glass and published in American Psycholo- Effect Size
gist in 1977, initiated the use of meta-analysis as Reviewing 375 studies on the efficacy of psycho-
a statistical tool capable of summarizing the results therapy, Smith and Glass calculated an index of
of numerous studies addressing a single topic. In effect size to determine the impact of treatment on
meta-analysis, individual research studies are iden- patients who received psychotherapy versus those
tified according to established criteria and treated assigned to a control group. The effect size was
as a population, with results from each study equal to the difference between the means of the
subjected to coding and entered into a database, experimental and control groups divided by the
where they are statistically analyzed. Smith and standard deviation of the control group. A positive
Glass pioneered the application of meta-analysis in effect size communicated the efficacy of a psy-
research related to psychological treatment and chological treatment in standard deviation units.
education. Their work is considered a major Smith and Glass found an effect size of .68, indi-
contribution to the scientific literature on psycho- cating that after psychological treatment, indivi-
therapy and has spurred hundreds of other meta- duals who had completed therapy were superior to
analytic studies since its publication. controls by .68 standard deviations, an effect size
that is generally classified as moderately large.
Historical Context
Other Findings
Smith and Glass conducted their research both in
response to the lingering criticisms of psychother- While best known for its contribution to research
apy lodged by Hans Eysenck beginning in 1952 on the general efficacy of psychotherapy, the Smith
and in an effort to integrate the increasing volume and Glass study also examined relative efficacy of
of studies addressing the efficacy of psychological specific approaches to therapy by classifying stud-
treatment. In a scathing review of psychotherapy, ies into 10 theoretical types and calculating an
Eysenck had asserted that any benefits derived effect size for each. Results indicated that approxi-
from treatment could be attributed to the mately 10% of the variance in the effects of
Methods Section 799
treatment could be attributed to the type of ther- At the time of the Smith and Glass publication,
apy employed, although the results were con- the statistical theory of meta-analysis was not yet
founded by differences in the individual studies, fully articulated. More recent studies using meta-
including the number of variables, the duration of analysis have addressed the technical problems
treatment, the severity of the presenting problem, found in earlier work. As a result, meta-analysis
and the means by which progress was evaluated. has become an increasingly influential technique in
The authors attempted to address these problems measuring treatment efficacy.
by collapsing the 10 types of therapies into four
classes: ego therapies, dynamic therapies, behav- Sarah L. Hastings
ioral therapies, and humanistic therapies, and then
See also Control Group; Effect Size, Measures of; Meta-
further collapsing the types of therapy into two
Analysis
superclasses labeled behavioral and nonbehavioral
therapies. They concluded that differences among
the various types of therapy were negligible. They Further Readings
also asserted that therapists’ degrees and creden-
tials were unrelated to the efficacy of treatment, as Chambliss, C. H. (2000). A review of relevant
psychotherapy outcome research. In C. H. Chambliss
was the length of therapy.
(Ed.), Psychotherapy and managed care: Reconciling
research and reality (pp. 197–214). Boston: Allyn and
Bacon.
Criticisms Eysenck, H. J. (1952). The effects of psychotherapy.
Journal of Consulting Psychology, 16, 319–324.
Publication of the Smith and Glass article pro-
Landman, J. T., & Dawes, R. M. (1982). Psychotherapy
mpted a flurry of responses from critics, including outcome: Smith and Glass’ conclusions stand up under
Eysenck, who argued that the studies included in scrutiny. American Psychologist, 37(5), 504–516.
the meta-analysis were too heterogeneous to be Lipsey, M., & Wilson, D. (1993). The efficacy of
compared and that many were poorly designed. psychological, educational, and behavioral treatment:
Some critics pointed out that an unspecified pro- Confirmation from meta-analysis. American
portion of studies included in the analysis did not Psychologist, 48(12), 1181–1209.
feature an untreated control group. Further, some Smith, M. L., & Glass, G. V. (1977). Meta-analysis of
studies did not have a placebo control group to psychotherapy outcome studies. American
rule out the effects of attention or expectation Psychologist, 32, 752–760.
Wampold, B. E. (2000). Outcomes of individual
among patients. A later reanalysis of the data by
counseling and psychotherapy: Empirical evidence
Janet Landman and Robyn Dawes published in addressing two fundamental questions. In S. D. Brown
1982 used more stringent criteria and featured sep- & R. W. Lent (Eds.), Handbook of counseling
arate analyses that used only studies that included psychology (3rd ed.). New York: Wiley.
placebo controls. Their analyses reached conclu-
sions that paralleled those of Smith and Glass.
METHODS SECTION
Influence
The Smith and Glass study not only altered the The purpose of a methods section of a research
landscape of the psychotherapeutic efficacy battle; paper is to provide the information by which
it also laid the groundwork for meta-analytic stud- a study’s validity is judged. It must contain enough
ies investigating a variety of psychological and edu- information so that (a) the study could be repeated
cational interventions. Their work provided an by others to evaluate whether the results are repro-
objective means of determining the outcome of ducible, and (b) others can judge whether the
a given intervention, summarizing the results of results and conclusions are valid. Therefore, the
large numbers of studies, and indicating not only methods section should provide a clear and precise
whether a treatment makes a difference, but how description of how a study was done and the ratio-
much of a difference. nale for the specific procedures chosen.
800 Methods Section
Historically, the methods section was referred or retrospective cohort, case–control, cross-
to as the ‘‘materials and methods section’’ to sectional), qualitative methods (e.g., ethnogra-
emphasize the two areas that must be addressed. phy, focus groups) and others (e.g., secondary
‘‘Materials’’ referred to what was studied (e.g., data analysis, literature review, meta-analysis,
humans, animals, tissue cultures), treatments appl- mathematical derivations, and opinion–editorial
ied, and instruments used. ‘‘Methods’’ referred to pieces). Here is a brief description of the designs.
the selection of study subjects, data collection, and Randomized trials involve the random allocation
data analysis. In some fields of study, because by the investigator of subjects to different
‘‘materials’’ does not apply, alternative headings interventions (treatments or conditions). Quasi-
such as ‘‘subjects and methods,’’ ‘‘patients and experiments involve nonrandom allocation. Both
methods,’’ or simply ‘‘methods’’ have been used or cohort (groups based on exposures) and case–
recommended. control (groups based on outcomes) studies are
Below are the items that should be included in longitudinal studies in which exposures and out-
a methods section. comes are measured at different times. Cross-
sectional studies measure exposures and out-
comes at a single time. Ethnography uses fieldwork
Subjects or Participants
to provide a descriptive study of human societies.
If human or animal subjects were used in the A focus group is a form of qualitative research in
study, who the subjects were and how they were which people assembled in a group are asked
relevant to the research question should be about their attitude toward a product or concept.
described. Any details that are relevant to the An example of secondary data is the abstraction of
study should be included. For humans, these data from existing administrative databases. A
details include gender, age, ethnicity, socioeco- meta-analysis combines the results of several
nomic status, and so forth, when appropriate. For studies that address a set of related research
animals, these details include gender, age, strain, hypotheses.
weight, and so forth. The researcher should also
describe how many subjects and how they were
selected. The selection criteria and rationale for
Data Collection
enrolling subjects into the study must be stated
explicitly. For example, the researcher should The next step in the methods section is a descrip-
define study and comparison subjects and the tion of the variables that were measured and how
inclusion and exclusion criteria of subjects. If the these measurements were made. In laboratory and
subjects were human, the type of reward or moti- experimental studies, the description of measure-
vation used to encourage them to participate ment instruments and reagents should include the
should be stated. When working with human or manufacturer and model, calibration process, and
animal subjects, there must be a declaration that how measurements were made. In epidemiologic
an ethics or institutional review board has deter- and social studies, the development and pretest
mined that the study protocol adheres to ethical of questionnaires, training of interviewers, data
principles. In studies involving animals, the prep- extraction from databases, and conduct of focus
arations made prior to the beginning of the study groups should be described where appropriate. In
must be specified (e.g., use of sedation and some cases, the survey instrument (questionnaire)
anesthesia). may be included as an appendix to the research
paper.
Study Design
Data Analysis
The design specifies the sequence of manipula-
tions and measurement procedures that make up The last step in the methods section is to
the study. Some common designs are experi- describe the way in which the data will be pre-
ments (e.g., randomized trials, quasi-experi- sented in the results section. For quantitative
ments), observational studies (e.g., prospective data, this step should specify whether and which
Method Variance 801
statistical tests will be used for making the infer- Kallet, R. H. (2004). How to write the methods section
ence. If statistical tests are used, this part of the of a research paper. Respiratory Care, 49, 1229–1232.
methods section must specify the significance Van Damme, H., Michel, L., Ceelen, W., & Malaise, J.
level and whether one- or two sided or the type (2007). Twelve steps to writing an effective ‘‘materials
and methods’’ section. Acta Chirurgica Belgica,
of confidence intervals. For qualitative data
107, 102.
a common analysis is observer impression. That
is, expert or lay observers examine the data,
form an impression, and report their impression
in a structured, quantitative form.
The following are some tips for writing the
METHOD VARIANCE
methods section: (a) The writing should be direct
and precise. Complex sentence structures and Method is what is used in the process of measuring
unimportant details should be avoided. (b) The something, and it is a property of the measuring
rationale or assumptions on which the methods instrument. The term method effects refers to the
are based may not always be obvious to the systematic biases caused by the measuring instru-
audience and so should be explained clearly. ment. Method variance refers to the amount of
This is particularly true when one is writing for variance attributable to the methods that are used.
a general audience, as opposed to a subspecialty In psychological measures, method variance is
group. The writer must always keep in mind often defined in relationship to trait variance. Trait
who the audience is. (c) The methods section variance is the variability in responses due to the
should be written in the past tense. (d) Subhead- underlying attribute that one is measuring. In con-
ings, such as participants, design, and so forth, trast, method variance is defined as the variability
may help readers navigate the paper. (e) If the in responses due to characteristics of the measur-
study design is complex, it may be helpful to ing instrument. After sketching a short history of
include a diagram, table, or flowchart to explain method variance, this entry discusses features of
the methods used. (f) Results should not be measures and method variance analyses and
placed in the methods section. However, the describes approaches for reducing method effects.
researchers may include preliminary results from
a pilot test they used to design the main study
A Short History
they are reporting.
The methods section is important because it No measuring instrument is free from error. This is
provides the information the reader needs to judge particularly germane in social science research,
the study’s validity. It should provide a clear and which relies heavily on self-report instruments.
precise description of how a study was conducted Donald Thomas Campbell was the first to mention
and the rationale for specific study methods and the problem of method variance. In 1959, Camp-
procedures. bell and Donald W. Fiske described the fallibility
inherent in all measures and recommended the use
Bernard Choi and Anita Pak of multiple methods to reduce error. Because no
single method can be the gold standard for mea-
See also Discussion Section; Results Section; Validity of
surement, they proposed that multiple methods be
Research Conclusions
used to triangulate on the underlying ‘‘true’’ value.
The concept was later extended to unobtrusive
Further Readings measures.
Method variance has not been well defined in
Branson, R. D. (2004). Anatomy of a research paper.
the literature. The assumption has been that the
Respiratory Care, 49, 1222–1228.
Hulley, S. B., Newman, T. B., & Cummings, S. R.
reader knows what is meant by method variance.
(1988). The anatomy and physiology of research. It is often described in a roundabout way, in rela-
In S. B. Hulley & S. R. Cummings (Eds.), Designing tionship to trait variance. Campbell and Fiske
clinical research (pp. 1–11). Baltimore: William & pointed out that there is no fixed demarcation
Wilkins. between trait and method. Depending on the goals
802 Method Variance
of a particular research project, a characteristic response formats used in social science research
may be considered either a method or a trait. are written paper-and-pencil tests.
Researchers have reported the methods that they
use as different tests, questionnaires with different
Response Categories
types of answers, self-report and peer ratings, clini-
cian reports, or institutional records, to name The response categories include the ways an
a few. item may be answered. Examples of response
In 1950 Campbell differentiated between struc- categories include multiple-choice items, matching,
tured and nonstructured measures, along with Likert-type scales, true–false answers, responses
those whose intent was disguised, versus measures to open-ended questions, and visual analogue
that were obvious to the test taker. Later Campbell scales. Close-ended questions are used most fre-
and others described the characteristics associated quently, probably because of their ease of adminis-
with unobtrusive methods, such as physical traces tration and scoring. Open-ended questions are
and archival records. More recently, Lee Sechrest used less frequently in social science research.
and colleagues extended this characterization to Often the responses to these questions are very
observable methods. short, or the question is left blank. Open-ended
Others have approached the problem of method questions require extra effort to code. Graphical
from an ‘‘itemetric’’ level, in paper-and-pencil responses such as visual analogue scales are infre-
questionnaires. A. Angleitner, O. P. John, and F. quently used.
Löhr proposed a series of item-level characteristics,
including overt reactions, covert reactions, bodily
Raters
symptoms, wishes and interests, attributes of traits,
attitudes and beliefs, biographical facts, others’ Raters are a salient method characteristic. Self-
reactions, and bizarre items. report instruments comprise the majority of mea-
sures. In addition to the self as rater, other raters
include, for example, teachers, parents, and peers.
Obvious Methods Other raters may be used in settings with easy
There appear to be obvious, or manifest, features access to them. For example, studies conducted in
of measurement, and these include stimulus for- schools often include teacher ratings and may
mats, response formats, response categories, raters, collect peer and parent ratings. Investigations in
direct rating versus summative scale, whether the medical settings may include ratings by clinicians
stimulus or response is rated, and finally, opaque and nurses.
versus transparent measures. These method char- The observability of the trait in question pro-
acteristics are usually mentioned in articles to bably determines the accuracy of the ratings by
describe the methods used. For example, an others. An easily observable trait such as extrover-
abs-tract may describe a measure as ‘‘a 30-item sion will probably generate valid ratings. However,
true–false test with three subscales,’’ ‘‘a structured characteristics that cannot be seen, particularly
interview used to collect school characteristics,’’ or those that the respondent chooses to hide, will be
‘‘patient functioning assessed by clinicians using a harder to rate. Racism is a good example of a char-
5-point scale.’’ acteristic that may not be amenable to ratings.
single items may be sufficient if a trait is obvi- and M. Ronald Buckley examined 70 published
ous and/or the respondent does not care about studies and reported that trait accounted for more
the results. than 40% of the variance and method accounted
for approximately 25%. D. Harold Doty and Wil-
liam H. Glick obtained similar results.
Rating the Stimulus Versus Rating the Response
Rating the prestige of colleges or occupations is
an example of rating the stimulus; self-report ques- Reducing Effects of Methods
tionnaires for extroversion or conscientiousness A variety of approaches can be used to lessen the
are examples of rating the response. The choice effects of methods in research studies. Awareness
depends on the goals of the study. of the problem is an important first step. The sec-
ond is to avoid measurement techniques laden
Opaque Versus Transparent Measures with method variance. Third, incorporate multiple
measures that use maximally different methods,
This method characteristic refers to whether the with different sources of error variance. Finally,
purpose of a test is easily discerned by the respon- the multiple measures can be combined into a trait
dent. The Stanford-Binet is obviously a test of estimate during analysis. Each course of action
intelligence, and the Myers-Briggs Type Indicator reduces the effects of methods in research studies.
inventory measures extroversion. These are trans-
parent tests. If the respondent cannot easily guess Melinda Fritchoff Davis
the purpose of a test, it is opaque.
See also Bias; Confirmatory Factor Analysis; Construct
Validity; Generalizability Theory; Multitrait–
Types of Analyses Used for Method Variance Multimethod Matrix; Rating; Triangulation; True
If a single method is used, it is not possible to Score; Validity of Measurement
estimate method effects. Multiple methods are
required in an investigation in order to study
method effects. When multiple methods are col- Further Readings
lected, they must be combined in some way to esti- Campbell, D. T. (1950). The indirect assessment of
mate the underlying trait. Composite scores or social attitudes. Psychological Bulletin, 47, 15–38.
latent factor models are used to estimate the trait. Cote, J. A., & Buckley, M. R. (1987). Estimating trait,
If the measures in a study have used different method, and error variance: Generalizing across 70
sources of error, the resulting trait estimate will construct validation studies. Journal of Marketing
contain less method bias. Research, 24, 315–318.
Doty, D. H., & Glick, W. H. (1998). Common methods
Estimating the effect of methods is more com-
bias: Does common methods variance really bias
plicated. Neal Schmitt and Daniel Stutts have pro- results? Organizational Research Methods, 1(4),
vided an excellent summary of the types of 374–406.
analyses that may be used to study method vari- Schmitt, N., & Stutts, D. M. (1986). Methodology
ance. Currently, the most popular method of anal- review: Analysis of multitrait-multimethod matrices.
ysis for multitrait–multimethod matrices is confir- Applied Psychological Measurement, 10, 1–22.
matory factor analysis. However, there are a variety Sechrest, L. (1975). Another look at unobtrusive
of problems inherent in this method, and gene- measures: An alternative to what? In W. Sinaiko & L.
ralizability theory analysis shows promise for Broedling (Eds.), Perspectives on attitude assessment:
multitrait–multimethod data. surveys and their alternatives (pp. 103–116).
Washington, DC: Smithsonian Institution.
Sechrest, L., Davis, M. F., Stickle, T., & McKnight, P.
Does Method Variance Pose a Real Problem? (2000). Understanding ‘‘method’’ variance. In L.
Bickman (Ed.), Research Design: Donald Campbell’s
The extent of variance attributable to methods Legacy (pp. 63–88). Thousand Oaks, CA: Sage.
has not been well studied, although several inter- Webb, E. T., Campbell, D. T., Schwartz, R. D., Sechrest,
esting articles have focused on it. Joseph A. Cote L., & Grove, J. B. (1981). Nonreactive measures in
804 Missing Data, Imputation of
precious. If the missings do not occur at random, is missing for a particular observation, and unit miss-
which is the most common situation, then deleting ingness refers to the situation in which all the values
can create significant bias. For some situations, it for an observation are missing. Figure 1 provides an
is possible to repair the bias through weighting— illustration of missingness.
as in poststratification for surveys. If the data set is Second, missings can be categorized by the
small or otherwise precious, then deleting can underlying nature of the missingness. These three
severely reduce the statistical power or value of categories are (1) missing completely at random
the data analysis. (MCAR), (2) missing at random (MAR), and
Imputation can repair the missing data by creat- (3) missing not at random (MNAR), summarized
ing one or more versions of how the data set in Table 1 and discussed below.
should appear. By leveraging external knowledge, Categorizing missings into one of these three
good technique, or both, it is possible to reduce groups provides better judgment as to the most
bias due to missing values. Some techniques offer appropriate imputation technique and the rami-
a quick improvement over deletion. Software is fications of employing that technique. MCAR is
making these techniques faster and sharper; how- the least common, yet the easiest to address.
ever, the techniques should be conducted by those MAR can be thought of as missing partially at
with appropriate training. random; the point is that there is some pattern
that can be leveraged. There are statistical tests
for inferring MCAR and MAR. There are many
Categorizing Missingness imputation techniques geared toward MAR.
Missingness can be categorized in two ways: the The potential of these techniques depends on the
physical structure of the missings and the underlying degree to which other variables are related to the
nature of the missingness. First, the structure of the missings. MNAR is also known as informative
missings can be due to item or unit missingness, the missing, nonignorable missingness. It is the most
merging of structurally different data sets, or barriers difficult to address. The most promising
attributable to the data collection tools. Item miss- approach is to use external data to identify and
ingness refers to the situation in which a single value repair this problem.
the other categories. The missings are assigned the followed by a maximization step, computing the
category with the closest mean. MLE. The technique assumes an underlying distri-
Hot deck and cold deck are techniques for bution, such as the normal, mixed normal, or
imputing real data into the missings, with or with- Student’s t.
out replacement. For hot deck, the donor data are The MLE method assumes that missing values
the same data set, and for cold deck, the donor data are MAR (as opposed to MCAR) and shares with
are another data set. Hot deck avoids extrapolating regression the problem of overfitting. MLE is con-
outside the range space of the data set, and it better sidered to be stronger than regression and to make
preserves the natural distribution than does imputa- fewer assumptions.
tion of a mean. Both tend to be better for MAR.
Multiple Imputation
Regression (Least Squares)
Multiple imputation leverages another imputation
Regression-based imputation predicts the miss- technique to impute and reimpute the missings. This
ings on the basis of ordinary–least-squares or technique creates multiple versions of the data set;
weighted–least-squares modeling of the nonmissing analyzes each one; and then combines the results,
data. This assumes that relationships among the usually by averaging. The advantages are that this
nonmissing data extrapolate to the missing-value process is easier than MLE, robust to departures
space. This technique assumes that the data are from underlying assumptions, and provides better
MAR and not MCAR. It creates bias depending on estimates of variance than regression does.
the degree to which the model is overfit. As always,
validation techniques such as bootstrapping or data
splitting will curb the amount of overfitting. Suggestions for Applications
Regression-based imputation underestimates the
A project’s final results should include reasons for
variance. Statisticians have studied the addition of
deleting or imputing. It should justify any selected
random errors to the imputed values as a technique
imputation technique and enumerate the corre-
to correct this underestimation. The random errors
sponding potential biases. As a check, it is advis-
can come from a designated distribution or from
able to compare the results obtained with
the observed data.
imputations and those obtained without them.
Regression-based imputation does not preserve
This comparison will reveal the effect due to impu-
the natural distribution or respect the associations
tation. Finally, there is an opportunity to clarify
between variables. Also, it repeats imputed values
wheather the missings provide an additional hur-
when the independent variables are identical.
dle or valuable information.
The approximate Bayesian bootstrap uses logistic See also Bias; Data Cleaning; Outlier; Residuals
regression to predict missing and nonmissing values
for the dependent variable, y, based on the observed
Further Readings
x values. The observations are then grouped on the
basis of the probability of the value missing. Candi- Efron, B. (1994). Missing data, imputation, and the
date imputation values are randomly selected, with bootstrap. Journal of the American Statistical
replacement, from the same group. Association, 89(426), 463–475.
Little, R. J. A. (1992). Regression with missing x’s: A
review. Journal of the American Statistical
MLE–Expectation Maximization Algorithm Association, 87(420), 1227–1237.
Little, R. J. A., & D. B. Rubin, D. B. (1987). Statistical
The expectation maximization algorithm is an analysis with missing data. New York: Wiley.
iterative, two-step approach for finding an MLE Schaefer, J., & Graham, J. (2002). Missing data: Our
for imputation. The initial step consists of deriving view of the state of the art. Psychological Methods,
an expectation based on latent variables. This is 7(2), 147–177.
808 Mixed- and Random-Effects Models
In order to make the concepts more concrete, let exist effects due to the operators as well, account-
yij denote the jth response obtained on the ith level ing for the differences among them. A possible
of the factor, where j ¼ 1; 2; . . . ; n; and i ¼ 1, model that could capture both the batch effects
2; . . . ; a. Here a denotes the number of levels of and the operator effects is
the factor, and n denotes the number of responses
obtained on each level. For our example, a ¼ 5, yij ¼ μ þ τi þ βj þ eij , ð2Þ
n ¼ 6, and yij is the jth determination of the cal-
cium content from the ith batch of raw material. i ¼ 1, 2; . . . ; a, j ¼ 1, 2, . . . , b, where yij is the cal-
The data analysis can be done assuming the follow- cium content measurement obtained from the ith
ing structure for the yijs, referred to as a model: batch by the jth operator; βj is the effect due to the
jth operator; and μ, the τis, and the eijs are as
yij ¼ μ þ τi þ eij ; ð1Þ defined before. In the context of the example,
a ¼ 5 and b ¼ 6. Now there are two input vari-
where μ is a common mean, the quantity τ i repre- ables, that is, batches and operators, that are
sents the effect due to the ith level of the factor expected to influence the response (i.e., the calcium
(effect due to the ith batch), and eij represents content measurement). This is an example of
experimental error. The eij s are assumed to be ran- a two-factor experiment. Note that if the batches
dom, following a normal distribution with mean as well as the operators are randomly selected,
zero and variance σ 2. In the fixed-effects case, the then the τ is, as well as the βjs, become random
τ is are fixed unknown parameters, and the prob- variables; the above model is then called a ran-
lem of interest is to test whether the τ is are equal. dom-effects model. However, if only the batches
The model for the yij is now referred to as a are randomly selected (so that the τ is are random),
fixed-effects
Pa model. In this case, the restriction but the measurements are taken by a given group
i¼1 τ i ¼ 0 can be assumed, without loss of gener- of operators (so that the βjs are fixed unknown
ality. In the random-effects case, the τis are parameters), then we have a mixed-effects model.
assumed to be random variables following a nor- That is, the model involves fixed effects corre-
mal distribution with mean zero and variance σ 2τ . sponding to the given levels of one factor and ran-
The model for the yij is now referred to as a ran- dom-effects corresponding to a second factor,
dom-effects model. Note that σ 2τ is a population whose levels are randomly selected. When the βjs
Pb
variance; that is, it represents the variability are fixed, the restriction j¼1 β j ¼ 0 may be
among the population of levels of the factor. Now assumed. For a random-effects model, independent
the problem of interest is to test the hypothesis normal distributions are typically assumed for the
that σ 2τ is zero. If this hypothesis is accepted, then τ is, for the βjs, and the eij s, similar to that for
the conclusion is that the different levels of the fac- Model 1. When the effects due to a factor are ran-
tor do not exhibit significant variability among dom, the hypothesis of interest is whether the cor-
them. In the context of the example, if the batches responding variance is zero. In the fixed-effects
are randomly selected, and if the hypothesis case, we test whether the effects are the same for
σ 2τ ¼ 0 is not rejected, then the data support the the different levels of the factor.
conclusion that there is no significant variability Note that Model 2 makes a rather strong ass-
among the different batches in the population. umption, namely, that the combined effect due to
the two factors, batch and operator, can be written
Mixed- and Random-Effects as the sum of an effect due to the batch and an
effect due to the operator. In other words, there is
Models for Multifactor Experiments
no interaction between the two factors. In practice,
In the context of the same example, suppose the such an assumption may not always hold when
six calcium content measurements on each batch responses are obtained based on the combined
are made by six different operators. While carrying effects of two or more factors. If interaction is pres-
out the measuring process, there could be differ- ent, the model should include the combined effect
ences among the operators. In other words, in due to the two factors. However, now multiple
addition to the effect due to the batches, there measurements are necessary to carry out the data
810 Mixed- and Random-Effects Models
analysis. Thus, suppose each operator makes three Data Analysis Based on
determinations of the calcium content on each Mixed- and Random-Effects Models
batch. Denote by yijk the kth calcium content mea-
surement on the ith batch by the jth operator. When When the same number of observations is obt-ained
there is interaction, the assumed model is on the various level combinations of the
factors, the data are said to be balanced. For Model
yijk ¼ m þ ti þ bj þ gij þ eijk ; ð3Þ 3, balanced data correspond to the situation in
which exactly n calcium content measurements (say,
where i ¼ 1, 2, . . . , a; j ¼ 1, 2, . . . , b; and k ¼ 1, n ¼ 6) are obtained by each operator from each
2, . . ., n (say). Now τi represents an average effect batch. Thus the observations are yijk; k ¼ 1, 2 ; . . . ;
due to the ith batch. In other words, consider the n; j ¼ 1, 2 ; . . . ; b; and i ¼ 1, 2 ; . . . ; a. On the other
combined effect due to the ith batch and the jth hand, unbalanced data correspond to the situation
operator, and average it over j ¼ 1; 2; . . . ; b. We in which the number of calcium content determina-
refer to τi as the main effect due to the ith batch. tions obtained by the different operators are not all
Similarly, βj is the main effect due to the jth opera- the same for all the batches. For example, suppose
tor. The quantity γ ij is the interaction between the there are five operators, and the first four make six
ith batch and the jth operator. If a set of given measurements each on the calcium content from
batches and operators is available for the experi- each batch, whereas the fifth operator could make
ment (i.e., there is no random selection), then the only three observations from each batch because of
τis, βjs, and γ ijs are all fixed, and Equation 3 is then time constraints; we then have unbalanced data. It
a fixed-effects model. On the other hand, if the could also be the case that an operator, say the first
batches are randomly selected, whereas the opera- operator, makes six calcium content determinations
tors are given, then the τ is and γ ijs are random, but from the first batch but only five each from the
the βjs are fixed. Thus Model 3 now becomes remaining batches. For unbalanced data, if nij
a mixed-effects model. However, if the batches and denotes the number of calcium content determina-
operators are both randomly selected, then all the tions made by the jth operator on the ith batch, the
effects in Equation 3 are random, resulting in a ran- observations are yijk; k ¼ 1, 2 ; . . . ; nij; j ¼ 1, 2 ; . . . ;
dom-effects model. Equation 3 is referred to as b; and i ¼ 1, 2 ; . . . ; a. The analysis of unbalanced
a two-way classification model with interaction. If data is considerably more complicated under
the batches are randomly selected, whereas the mixed- and random-effects models, even under nor-
operators are given, then a formal derivation of mality assumptions. The case of balanced data is
Pb somewhat simpler.
Model P 3 will result in the conditions β
j¼1 j ¼ 0
and bj¼1 γ ij ¼ 0 for every i. That is, the random
variables γ ijs satisfy a restriction. In view of this, the
Analysis of Balanced Data
γ ijs corresponding to a fixed i will not be indepen-
dent among themselves. The normality assumptions Consider the simple Model 1 with balanced
are typically made on all the random quantities. data, along with the normality assumptions for the
In the context of Model 1, note that two obser- distribution of the τis and the eijs with variances σ 2τ
vations from the same batch are correlated. In fact and σ 2e , respectively. The purpose of the data analy-
σ 2τ is also the covariance between two observations sis can be to estimate the variances σ 2τ and σ 2e , to
from the same batch. Other examples of correlated test the null hypothesis that σ 2τ ¼ 0, and to compute
data where mixed- and random-effects models are a confidence interval for σ 2τ and sometimes for the
appropriate include longitudinal data, clustered ratios σ 2τ =σ 2e and σ 2τ =ðσ 2τ þ σ 2τ ). Note that the ratio
data, and repeated measures data. By including σ 2τ =σ 2e provides information on the relative magni-
random-effects in the model, it is possible for tude of σ 2τ compared with that of σ 2e . If the variabil-
researchers to account for multiple sources of vari- ity in the data is mostly due to the variability
ation. This is indeed the purpose of using mixed- among the different levels of the factor, σ 2τ is
and random-effects models for analyzing data in expected to be large compared with σ 2e , and the
the physical end engineering sciences, medical and hypothesis σ 2τ ¼ 0 is expected to be rejected. Also
biological sciences, social sciences, and so forth. note that since the variance of the observations, that
Mixed- and Random-Effects Models 811
is, the variance of the yijs in Model 1, is simply the The following table shows the ANOVA table
sum σ 2τ þ σ 2e , the ratio σ 2τ =ðσ 2τ þ σ 2e Þ is the fraction and the expected mean squares in the mixed-
of the total variance that is due to the variability effects case and in the random-effects case; these
among the different levels of the factor. Thus the are available in a number of books dealing with
individual variances as well as the above ratios have mixed- and random-effects models, in particular in
practical meaning and significance. Montgomery’s book on experimental designs. In
Now consider Model 3 with random effects the mixed-effects case, the βjs are fixed, but the τ is
and with the normality assumptions τi ∼ Nð0; σ 2τ Þ, and γ ijs are random. In other words, the batches
βj ∼ Nð0; σ 2β Þ, γ ij ∼ Nð0; σ 2γ Þ, and eijk ∼ Nð0; σ 2e Þ, are randomly selected, but the operators consist of
where all the random variables are assumed to be a fixed group. In the random-effects case, the βjs,
independently distributed. Now the problems of the τis, and the γ ijs are all random. In the table,
interest include the estimation of the different var- the notations MSτ and so forth are used to denote
iances and testing the hypothesis that the random- mean squares.
effects variances are zeros. For example, if the Note that the expected values can be quite dif-
hypothesis σ 2γ ¼ 0 cannot be rejected, we conclude ferent depending on whether we are in the
mixed-effects setup or the random-effects setup.
that there is no significant interaction. If Model 3
Also, when an expected value is a linear combi-
is a mixed-effects model, then the normality
nation of only the variances, then the sum of
assumptions are made on the effects that are
squares, divided by the expected value, has
random. Note, however, that the γ ijs, although
a chi-square distribution. For example, in Table
random, will no longer be independent in the
1, in the mixed-effects case, SSτ =ðσ 2e þ bnσ 2τ Þ
mixed-effects case, in view of the restriction
Pb follows a chi-square distribution with a 1
j¼1 γ ij ¼ 0 for every i. degrees of freedom. However, if an expected
The usual analysis of variance (ANOVA) value also involves the fixed-effects parameters,
decomposition can be used to arrive at statistical then the chi-square distribution holds under the
procedures to address all the above problems. To appropriate hypothesis concerning the fixed
define the various ANOVA sums of squares for effects. Thus, in Table 1, in the mixed-effects
Model 3 in the context of our example on calcium case, SSβ =ðσ 2e þ nσ 2γ Þ follows a chi-square distri-
content determination from different batches using bution with b 1 degrees of freedom, under the
different operators, let hypothesis β1 ¼ β2 ¼ ¼ βb ¼ 0. Furthermore,
for testing the various hypotheses,
1X n
1 Xb X n
the denominator of the F ratio is not always the
yij: ¼ yijk ; yi:: ¼ yijk ; y:j: mean square due to error. If one compares the
n k¼1 bn j¼1 k¼1
expected values in Table 1, one can see that for
1 X a X n
1 X a X b X n
testing σ 2τ ¼ 0 in the mixed-effects case, the F
¼ yijk ; y::: ¼ yijk :
an i¼1 k¼1 abn i¼1 j¼1 k¼1 ratio is MSτ /MSe. However, for testing β1 ¼
β2 ¼ . . . ¼ βb ¼ 0 in the mixed-effects case, the
If SSτ, SSβ , SSγ , and SSe denote the ANOVA sum F ratio is MSβ/MSγ . In view of this, it is neces-
of squares due to the batches, operators, interac- sary to know the expected values before we can
tion, and error, respectively, these are given by decide on the appropriate F ratio for testing
a hypothesis under mixed- and random-effects
X
a b
X 2 models. Fortunately, procedures are available for
SSτ ¼ bn ðyi:: y... Þ2 ; SSβ ¼ an y:j: y::: the easy calculation of the expected values when
i¼1 j¼1 the data are balanced. The expected values in
X b
a X 2 Table 1 immediately provide us with F ratios for
SSγ ¼ n yij: y::: SSτ SSβ ; testing all the different hypotheses in the mixed-
i¼1 j¼1 effects case, as well as in the random-effects
X
a X n
b X 2 case. This is so because, under the hypothesis,
SSe ¼ yijk yij: : we can identify exactly two sums of squares hav-
i¼1 j¼1 k¼1 ing the same expected value. However, this is
812 Mixed Methods Design
Table 1 ANOVA and Expected Mean Squares Under Model 3 With Mixed Effects and With Random Effects, When
the Data Are Balanced
Source of Sum of Degrees of Mean Expected Mean Square Expected Mean Square
Variability Squares (SS) Freedom Square (MS) (Mixed-Effects Case) (Random-Effects Case)
Batches SSτ a1 MSτ σ 2e þ bnσ 2τ σ 2e þ nσ 2γ þ bnσ 2τ
Pb 2
β
j¼1 j
Operators SSβ b1 MSβ σ 2e þ nσ 2γ þ an b1 σ 2e þ nσ 2γ þ bnσ 2β
Interaction SSγ (a 1)(b 1) MSγ σ 2e þ nσ 2γ σ 2e þ nσ 2γ
Error SSe ab(n 1) MSe σ 2e σ 2e
not always the case. Sometimes it becomes nec- IBM company, formerly called PASWâ Statistics),
essary to use a test statistic that is a ratio of the Stata, and so forth. Their use is rather straight-
sum of appropriate mean squares, both in the forward; in fact many excellent books are now
numerator and in the denominator. The test is available that illustrate the use of these software
then carried out using an approximate F distri- packages. A partial list of such books is provided
bution. This procedure is known as the Sat- in the Further Readings. These books provide the
terthwaite approximation. necessary software codes, along with worked-out
examples.
Analysis of Unbalanced Data Thomas Mathew
The nice formulas and procedures available for See also Analysis of Variance (ANOVA); Experimental
mixed- and random-effects models with balanced Design; Factorial Design; Fixed-Effects Models;
data are not available in the case of unbalanced Random-Effects Models; Simple Main Effects
data. While some exact procedures can be derived
in the case of a single factor experiment, that is,
Model 1, such is not the case when we have a mul- Further Readings
tifactor experiment. One option is to analyze the Littell, R. C., Milliken, G. A., Stroup, W. W., &
data with likelihood-based procedures. That is, Wolfinger, R. D. (1996). SAS system for mixed
one can estimate the parameters by maximizing models. Cary, NC: SAS Publishing.
the likelihood and then test the relevant hypothe- Montgomery, D. C. (2009). Design and analysis of
ses with likelihood ratio tests. The computations experiments (7th ed.). New York: Wiley.
have to be carried out by available software. Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects
As for estimating the random-effects variances, models in S and S-Plus. New York: Springer-Verlag.
a point to note is that estimates can be obtained Verbeke, G., & Molenberghs, G. (1997). Linear mixed
on the basis of either the likelihood or the res- models in practice: A SAS-oriented approach (Lecture
notes in statistics, Vol. 126). New York: Springer-
tricted likelihood. Restricted likelihood is free of
Verlag.
the fixed-effects parameters. The resulting esti- Verbeke, G., & Molenberghs, G. (2000). Linear mixed
mates of the variances are referred to as restricted models for longitudinal data. New York: Springer-
maximum likelihood (REML) estimates. REML Verlag.
estimates are preferred to maximum likelihood West, B. T., Welch, K. B., & Galecki, A. T. (2007). Linear
estimates because the REML estimates reduce (or mixed models: A practical guide using statistical
eliminate) the bias in the estimates. software. New York: Chapman & Hall/CRC.
integrates techniques from quantitative and quali- (d) data analysis strategies, and (e) knowledge dis-
tative paradigms to tackle research questions that semination. The design of a study thus leads to the
can be best addressed by mixing these two tradi- choice of method strategy. The framework for
tional approaches. As long as 40 years ago, scho- a study, then, depends on the phenomenon being
lars noted that quantitative and qualitative studied, with the participants and relevant theories
research were not antithetical and that every informing the research design. Most study designs
research process, through practical necessity, today need to include both quantitative and quali-
should include aspects of both quantitative and tative methods for gathering effective data and
qualitative methodology. In order to achieve more can thereby incorporate a more expansive set of
useful and meaningful results in any study, it is assumptions and a broader worldview.
essential to consider the actual needs and purposes Mixing methods (or multiple-methods design) is
of a research problem to determine the methods to generally acknowledged as being more pertinent to
be implemented. The literature on mixed methods modern research than using a single approach.
design is vast, and contributions have been made Quantitative and qualitative methods may rely more
by scholars from myriad disciplines in the social on single data collection methods. For example,
sciences. Therefore, this entry is grounded in the whereas a quantitative study may rely on surveys for
work of these scholars. This entry provides a histor- collecting data, a qualitative study may rely on
ical overview of mixed methods as a paradigm for observations or open-ended questions. However, it is
research, establishes differences between quantita- also possible that each of these approaches may use
tive and qualitative designs, shows how qualitative multiple data collection methods. Mixed methods
and quantitative methods can be integrated to design ‘‘triangulates’’ these two types of methods.
address different types of research questions, and When these two methods are used within a single
illustrates some implications for using mixed meth- research study, different types of data are combined
ods. Though still new as an approach to research, to answer the research question—a defining feature
mixed methods design is expected to soon domi- of mixed methods. This approach is already stan-
nate the social and behavioral sciences. dard in most major designs. For example, in social
The objective of social science research is to sciences, interviews and participant observation form
understand the complexity of human behavior and a large part of research and are often combined with
experience. The task of the researcher, whose role other data (e.g., biological markers).
is to describe and explain this complexity, is lim- Even though the integration of these two
ited by his or her methodological repertoire. As research models is considered fairly novel (emerg-
tradition shows, different methods often are best ing significantly in the 1960s), the practice of
applied to different kinds of research. Having the integrating these two models has a long history.
opportunity to apply various methods to a single Researchers have often combined these methods, if
research question can broaden the dimensions and perhaps only for particular portions of their inves-
scope of that research and perhaps lead to a more tigations. Mixed methods research was more com-
precise and holistic perspective of human behavior mon in earlier periods when methods were less
and experience. Research is not knowledge itself, specialized and compartmentalized and when there
but a process in which knowledge is constructed was less orthodoxy in method selection. Research-
through step-by-step data gathering. ers observed and cross-tabulated, recognizing that
Data are gathered most typically through two each methodology alone could be inadequate. Syn-
distinct classical approaches—qualitative and thesis of these two classic approaches in data gath-
quantitative. The use of both these approaches for ering and interpretation does not necessarily mean
a single study, although sometimes controversial, that they are wholly combined or that they are
is becoming more widespread in social science. uniform. Often they need to be employed sepa-
Methods are really ‘‘design’’ components that rately within a single research design so as not to
include the following: (a) the relationship between corrupt either process.
the researcher and research ‘‘subjects,’’ (b) details Important factors to consider when one is
of the experimental environment (place, time, using mixed methods can be summarized as fol-
etc.), (c) sampling and data collection methods, lows. Mixed methods researchers agree that
814 Mixed Methods Design
there are some resonances between the two para- whereas the qualitative method concentrates on
digms that encourage mutual use. The dis- events within a context, relying on meaning and
tinctions between these two methods cannot process. When the two are used together, data can
necessarily be reconciled. Indeed, this ‘‘tension’’ can be transformed. Essentially, ‘‘qualitized’’ data can
produce more meaningful interactions and thus represent data collected using quantitative meth-
new results. Combination of qualitative and quanti- ods that are converted into narratives that are ana-
tative methods must be accomplished productively lyzed qualitatively. ‘‘Quantitized’’ data represent
so that the integrity of each approach is not vio- data collected using qualitative methods that can
lated: Methodological congruence needs to be be converted into numerical codes and analyzed
maintained so that data collection and analytical statistically. Many research problems are not lin-
strategies are not jeopardized and can be consistent. ear. Purpose drives the research questions. The
The two seemingly antithetical research approaches course of the study, however, may change as it pro-
can be productively combined in a pragmatic, inter- gresses, leading possibly to different questions and
active, and integrative design model. The two ‘‘clas- the need to alter method design. As in any rigorous
sical’’ methods can complement each other and research, mixed methods allows for the research
make a study more successful and resourceful by question and purpose to lead the design.
eliminating the possibility of distortion by strict
adherence to a single formal theory.
Historical Overview
Qualitative and Quantitative Data
In the Handbook of qualitative research, Norman
Qualitative and quantitative distinctions are gro- K. Denzin and Yvonna S. Lincoln classified four
unded in two contrasting approaches to categoriz- historic periods in research history for the social
ing and explaining data. Different paradigms sciences. Their classification shows an evolution
produce and use different types of data. Early from strict quantitative methodology, a gradual
studies distinguished the two methods according implementation and acceptance of qualitative
to the kind of data collected, whether textual or methods, to a merging of the two: (1) traditional
numerical. The classic qualitative approach inclu- (quantitative), 1900 to 1950; (2) modernist, 1950
des study of real-life settings, focus on participants’ to 1970; (3) ascendance of constructivism, 1970 to
context, inductive generation of theory, open- 1990; and (4) pragmatism and the ‘‘compatibility
ended data collection, analytical strategies based thesis’’ (discussed later), 1990 to the present.
on textual data, and use of narrative forms of Quantitative methodology, and its paradigm,
analysis and presentation. Basically, the qualitative positivism, dominated methodological orientation
method refers to a research paradigm that add- during the first half of the 20th century. This ‘‘tra-
resses interpretation and socially constructed ditional’’ period, although primarily focused on
realities. The classic quantitative approach encom- quantitative methods, did include some mixed
passes hypothesis formulation based on prece- method approaches without directly acknowledg-
dence, experiment, control groups and variables, ing implementation of qualitative data: Studies
comparative analysis, sampling, standardization of often made extensive use of interviews and res-
data collection, statistics, and the concept of cau- earcher observations, as demonstrated in the
sality. Quantitative design refers to a research par- Hawthorne effect. In the natural sciences, such as
adigm that hypothesizes relationships between biology, paleontology, and geology, goals and
variables in an objective way. methods that typically would be considered quali-
Quantitative methods are related to deductivist tative (naturalistic settings, inductive approaches,
approaches, positivism, data variance, and factual narrative description, and focus on context and
causation. Qualitative methods include inductive single cases) have been integrated with those
approaches, constructivism, and textual informa- that were regarded as quantitative (experimental
tion. In general, quantitative design relies on com- mani-pulation, controls and variables, hypothesis
parisons of measurements and frequencies across testing, theory verification, and measurement and
categories and correlations between variables analysis of samples) for more than a century.
Mixed Methods Design 815
After World War II, positivism began to be dis- against the prejudices and restrictions of positivism
credited, which led to its ‘‘intellectual’’ successor, and postpositivism. They maintained that mixed
postpositivism. Postpositivism (still largely in the methods were already being employed in numer-
domain of the quantitative method) asserts that ous studies.
research data are influenced by the values of the The period of pragmatism and compatibility
researchers, the theories used by the researchers, (1990–the present) as defined by Denzin and Lin-
and the researchers’ individually constructed reali- coln constitutes the establishment of mixed meth-
ties. During this period, some of the first explicit ods as a separate field. Mixed methodologists are
mixed method designs began to emerge. While not representative of either the traditional (quanti-
there was no distinctive categorization of mixed tative) or ‘‘revolutionary’’ (qualitative) camps. In
methods, numerous studies began to employ com- order to validate this new field, mixed methodolo-
ponents of its design, especially in the human gists had to show a link between epistemology and
sciences. Data obtained from participant observa- method and demonstrate that quantitative and qual-
tion (qualitative information) was often implemen- itative methods were compatible. One of the main
ted, for example, to explain quantitative results concerns in mixing methods was to determine
from a field experiment. whether it was also viable to mix paradigms—a con-
The subsequent ‘‘modernist’’ period, or ‘‘Golden cept that circumscribes an interface, in practice,
Age’’ (1950–1970), has been demarcated, then, by between epistemology (historically learned assump-
two trends: positivism’s losing its stronghold and tions) and methodology. A new paradigm, pragma-
research methods that began to incorporate ‘‘multi tism, effectively combines these two approaches
methods.’’ The discrediting of positivism resulted and allows researchers to implement them in a com-
in methods that were more radical than those of plementary way.
postpositivism. From 1970 to 1985—defined by Pragmatism addresses the philosophical aspect
some scholars as the ‘‘qualitative revolution’’— of a paradigm by concentrating on what works.
qualitative researchers became more vocal in their Paradigms, under pragmatism, do not represent
criticisms of pure quantitative approaches and pro- the primary organizing principle for mixed meth-
posed new methods associated with constructiv- ods practice. Believing that paradigms (socially
ism, which began to gain wider acceptance. In the constructed) are malleable assumptions that
years from 1970 to 1990, qualitative methods, change through history, pragmatists make design
along with mixed method syntheses, were becom- decisions based on what is practical, contextually
ing more eminent. In the 1970s, the combination compatible, and consequential. Decisions about
of data sources and multiple methods was becom- methodology are not based solely on congruence
ing more fashionable, and new paradigms, such as with established philosophical assumptions but are
interpretivism and naturalism, were gaining prece- founded on a methodology’s ability to further the
dence and validity. particular research questions within a specified con-
In defense of a ‘‘paradigm of purity,’’ a period text. Because of the complexity of most contexts
known as the paradigm wars took place. Different under research, pragmatists incorporate a dual focus
philosophical camps held that quantitative and between sense making and value making. Pragmatic
qualitative methods could not be combined; such research decisions, grounded in the actual context
a ‘‘blending’’ would corrupt accurate scientific being studied, lead to a logical design of inquiry
research. Compatibility between quantitative and that has been termed fitness for purpose. Mixed
qualitative methods, according to these proponents methodologies are the result. Pragmatism demon-
of quantitative methods, was impossible due to the strates that singular paradigm beliefs are not intrin-
distinction of the paradigms. Researchers who sically connected to specific methodologies; rather,
combined these methods were doomed to fail methods and techniques are developed from multi-
because of the inherent differences in the underly- ple paradigms.
ing systems. Qualitative researchers defined such Researchers began to believe that the concept
‘‘purist’’ traditions as being based on ‘‘received’’ of a single best paradigm was a relic of the past
paradigms (paradigms preexisting a study that are and that multiple, diverse perspectives were criti-
automatically accepted as givens), and they argued cal to addressing the complexity of a pluralistic
816 Mixed Methods Design
society. They proposed what they defined as the the ‘‘mixing’’ occurs in the type of questions asked
dialectical stance: Opposing views (paradigms) and in the inferences that evolve. Mixed model
are valid and provide for more realistic interac- research is implemented in all stages of the study
tion. Multiple paradigms, then, are considered (questions, methods, data collection, analysis, and
a foundation for mixed methods research. inferences).
Researchers, therefore, need to determine which The predominant approach to mixing methods
paradigms are best for a particular mixed meth- encompasses two basic types of design: component
ods design for a specific study. and integrated. In component designs, methods
Currently, researchers in social and behavioral remain distinct and are used for discreet aspects of
studies generally comprise three groups: Quanti- the research. Integrative design incorporates sub-
tatively oriented researchers, primarily interested stantial integration of methods. Although typolo-
in numerical and statistical analyses; qualita- gies help researchers organize actual use of both
tively oriented researchers, primarily interested methods, use of typologies as an organizing tool
in analysis of narrative data; and mixed metho- demonstrates a lingering linear concept that refers
dologists, who are interested in working with more to the duality of quantitative and qualitative
both quantitative and qualitative data. The dif- methods than to the recognition and implementa-
ferences between the three groups (particularly tion of multiple paradigms. Design components
between quantitatively and qualitatively oriented (based on objectives, frameworks, questions, and
researchers) have often been characterized as the validity strategies), when organized by typology,
paradigm wars. These three movements continue are perceived as separate entities rather than as
to evolve simultaneously, and all three have been interactive parts of a whole. This kind of typology
practiced concurrently. Mixed methodology is in illustrates a pluralism that ‘‘combines’’ methods
its adolescent stage as scholars work to deter- without actually integrating them.
mine how to best integrate different methods.
Triangulation and Validity
Integrated Design Models
Triangulation is a method that combines differ-
A. Tashakkori and C. Teddlie have referred to ent theoretical perspectives within a single study.
three categories of multiple-method designs: multi- As applied to mixed methods, triangulation deter-
method research, mixed methods research, and mines an unknown point from two or more
mixed model research. The terms multimethod known points, that is, collection of data from dif-
and mixed method are often confused, but they ferent sources, which improves validity of results.
actually refer to different processes. In multi- In The Research Act, Denzin argued that a hypoth-
method studies, research questions use both quan- esis explored under various methods is more valid
titative and qualitative procedures, but the process than one tested under only one method. Triang-
is applied principally to quantitative studies. This ulation in methods, where differing processes
method is most often implemented in an interre- are implemented, maximizes the validity of the
lated series of projects whose research questions research: Convergence of results from different
are theoretically driven. Multimethod research is measurements enhances validity and verification.
essentially complete in itself and uses simultaneous It was also argued that using different methods,
and sequential designs. and possibly a faulty commonality of framework,
Mixed methods studies, the primary concern of could lead to increased error in results. Triangula-
this entry, encompass both mixed methods and tion may not increase validity but does increase
mixed model designs. This type of research imple- consistency in methodology: Though empirical
ments qualitative and quantitative data collec- results may be conflicting, they are not inherently
tion and analysis techniques in parallel phases or damaging but render a more holistic picture.
sequentially. Mixed methods (combined methods) Triangulation allows for the exploration of both
are distinguished from mixed model designs (com- theoretical and empirical observation (inductive
bined quantitative and qualitative methods in all and deductive), two distinct types of knowle-
phases of the research). In mixed methods design, dge that can be implemented as a methodological
Mixed Methods Design 817
‘‘map’’ and are logically connected. A researcher problems, the study of service utilization and deliv-
can structure a logical study, and the tools needed ery, and translational research into meaningful
for organizing and analyzing data, only if the theo- practice.
retical framework is established prior to empirical Mixed methods research may bridge postmod-
observations. Triangulation often leads to a situa- ern critiques of scientific inquiry and the growing
tion in which different findings do not converge or interest in qualitative research. Mixed methods
complement each other. Divergence of results, research provides an opportunity to test research
however, may lead to additional valid explanations questions, hypotheses, and theory and to acknowl-
of the study. Divergence, in this case, can be reflec- edge the phenomena of human experience. Quan-
tive of a logical reconciliation of quantitative and titative methods support the ability to generalize
qualitative methods. It can lead to a productive findings to the general population. However, quan-
process in which initial concepts need to be modi- titative approaches that are well regarded by
fied and adapted to differing study results. researchers may not necessarily be comprehensible
Recently, two new approaches for mixing or useful to lay individuals. Qualitative approaches
methods have been introduced: an interactive can help contextualize problems in narrative forms
approach, in which the design components are and thus can be more meaningful to lay indivi-
integrated and mutually influence each other, duals. Mixing these two methods offers the poten-
and a conceptual approach, using an analysis of tial for researchers to understand, contextualize,
the fundamental differences between quantita- and develop interventions.
tive and qualitative research. The interactive Mixed methods have been used to examine and
method, as employed in architecture, engineer- implement a wide range of research topics, includ-
ing, and art, is neither linear nor cyclic. It is ing instrument design, validation of constructs, the
a schematic method that addresses data in relationship of constructs, and theory development
a mutually ongoing arrangement. This design or disconfirmation. Mixed methods are rooted,
model is a tool that focuses on analyzing the for one example, in the framework of feminist
research question rather than providing a tem- approaches whereby the study of participants’ lives
plate for creating a study type. This more quali- and personal interpretations of their lives has
tative approach to mixed methods design implications in research. In terms of data analysis,
emphasizes particularity, context, comprehen- content analysis is a way for scientists to confirm
siveness, and the process by which a particular hypotheses and to gather qualitative data from
combination of qualitative and quantitative study participants through different methods (e.g.,
components develops in practice, in contrast to grounded theory, phenomenological, narrative).
the categorization and comparison of data typi- The application of triangulation methodology is
cal of the pure quantitative approach. extremely invaluable in mixed methods research.
While there are certainly advantages to employ-
ing mixed methods in research, their use also
presents significant challenges. Perhaps the most
Implications for Mixed Methods
significant issue to consider is the amount of time
As the body of research regarding the role of the associated with the design and implementation of
environment and its impact on the individual has mixed methods. In addition to time restrictions,
developed, the status and acceptance of mixed costs or barriers to obtaining funding to carry out
methods research in many of the applied disciplines mixed methods research are a consideration.
is accelerating. This acceptance has been influenced
by the historical development of these disciplines
Conclusion
and an acknowledgment of a desire to move away
from traditional paradigms of positivism and post- Rather than choosing one paradigm or method
positivism. The key contributions of mixed meth- over another, researchers often use multiple and
ods have been to an understanding of individual mixed methods. Implementing these newer combi-
factors that contribute to social outcomes, the study nations of methods better supports the modern
of social determinants of medical and social complexities of social behavior the changing
818 Mixed Model Design
measures on the same subjects. These are within- effects are the measures from the repeated trials,
subjects designs for two factors with two or more measures after the time intervals of some activities,
levels. In the two-way mixed model design, two or repeated measures of some function such as
factors, one for within-subjects and one for blood pressure, strength level, endurance, or
between-subjects are always included in the model. achievement. The mixed model design may be
Each factor has two or more levels. For example, applied when the sample comprises large units,
in a study to determine the preferred time of day such as school districts, military bases, and univer-
for undergraduate and graduate college students to sities, and the variability among the units, rather
exercise at a gym, time of day would be a within- than the differences in means, is of interest. Exam-
subjects factor with three levels: 5:00 a.m., 1:00 ining random effects allows researchers to make
p.m., and 9:00 p.m.; and student classification as inferences to a larger population.
undergraduate or graduate would be two levels of
a between-subjects factor. The dependent variable
Assumptions
for such a study could be a score on a workout
preference scale. A design with three levels on As with other inferential statistical procedures, the
a random factor and two levels on a fixed factor is data for a mixed model analysis must meet certain
written as a 2 × 3 mixed model design. statistical assumptions if trustworthy generaliza-
tions are to be made from the sample to the lar-
ger population. Assumptions apply to both the
Fixed and Random Effects
between- and within-subjects effects. The between-
Fixed Effects subjects assumptions are the same as those in
a standard ANOVA: independence of scores;
Fixed effects, also known as between-subjects
normality; and equal variances, known as homo-
effects, are those in which each subject is a member
geneity of variance. Assumptions for the within-
of either one group or another, but not more than
subjects effects are independence of scores and
one group. All levels of the factor may be included,
normality of the distribution of scores in the larger
or only selected levels. In other words, subjects are
population. The mixed model also assumes that
measured on only one of the designated levels
there is a linear relationship between the depen-
of the factor, such as undergraduate or graduate.
dent and independent variables. In addition, the
Other examples of fixed effects are gender, mem-
complexity of the mixed model design requires the
bership in a control group or an experimental
assumption of equality of variances of the differ-
group, marital status, and religious affiliation.
ence scores for all pairs of scores at all levels of the
within-subjects factor and equal covariances for
the between-subjects factor. This assumption is
Random Effects
known as the sphericity assumption. Sphericity is
Random effects, also known as within-subjects especially important to the mixed model analysis.
effects, are those in which measures of each level
of a factor are taken on each subject, and the
Sphericity Assumption
effects may vary from one measure to another over
the levels of the factor. Variability in the dependent The sphericity assumption may be thought of
variable can be attributed to differences in the as the homogeneity-of-variance assumption for
random factor. In the previous example, all sub- repeated measures. This assumption can be tested
jects would be measured across all levels of the by conducting correlations between and among all
time-of-day factor for exercising at a gym. In levels of repeated measures factors and using Bart-
a study in which time is a random effect and gen- lett’s test of sphericity. A significant probability
der is a fixed effect, the interaction of time and level (p value) means that the data are correlated
gender is also a random effect. Other examples of and the sphericity assumption is violated. How-
random effects are number of trials, in which each ever, if the data are uncorrelated, then sphericity
subject experiences each trial or each subject can be assumed. Multivariate ANOVA (MAN-
receives repeated doses of medication. Random OVA) procedures do not require that the sphericity
820 Mixed Model Design
between the second measure and the third measure a covariance structure that accommodates vari-
would be another parameter, and so forth. Like ance heterogeneity is appropriate.
the AR(1) model, the Toeplitz is a suitable choice Another procedure for evaluating a covariance
for evenly spaced measures. matrix involves creating several different probable
models using both the maximum likelihood and the
restricted maximum likelihood methods of parame-
First Order: Ante-Dependence
ter estimation. The objective is to select the covari-
The first-order ante-dependence model is a more ance structure that gives the best fit of the data to
general model than the Toeplitz or the AR(1) mod- the model. Information criterion measures, pro-
els. Covariances are dependent on the product of duced as part of the results of each mixed model
the variances at the two points of interest, and cor- procedure, indicate a relative goodness of fit of the
relations are weighted by the variances of the two data, thus providing guidance in model evaluation
points of interest. For example, a correlation of and selection. The information criteria measures for
.70 for points 1 and 2 and a correlation of .20 for the same data set under different models (different
points 2 and 3 would produce a correlation of .14 covariance structures) and estimated with different
for points 1 and 3. This model requires 2n 1 methods can be compared; usually, the information
parameters to be estimated, where n is the number criterion with the smallest value indicates a better fit
of repeated measures for a factor. of the data to the model. Several different criterion
measures can be produced as part of the statistical
analysis. It is not uncommon for the information
Evaluating Covariance Models
criteria measures to be very close in value.
The data should be examined prior to the anal-
ysis to verify whether the mixed model design or
Hypothesis Testing
the standard repeated measures design is the
appropriate procedure. Assuming that the mixed The number of null hypotheses formulated for a
model procedure is appropriate for the data, the mixed model design depends on the number of
next step is to select the covariance structure that factors in the study. A null hypothesis should be
best models the data. The sphericity test alone is generated for each factor and for every combina-
not an adequate criterion by which to select tion of factors. A mixed model analysis with one
a model. A comparison of information criteria for fixed effect and one random effect generates three
several probable models with different covariance null hypotheses. One null hypothesis would be
structures that uses the maximum likelihood and stated for the fixed effects; another null hypothesis
restricted maximum likelihood estimation methods would be stated for the random effects; and a third
is helpful in selecting the best model. hypothesis would be stated for the interaction of
One procedure for evaluating a covariance the fixed and random effects. If more than one
structure involves creating a mixed model with fixed or random factor is included, multiple inter-
an unstructured covariance matrix and examin- actions may be of interest. The mixed model
ing graphs of the error covariance and correla- design allows researchers to select only the interac-
tion matrices. Using the residuals, the error tions in which they are interested.
covariances or correlations can be plotted sepa- The omnibus F test is used to test each null
rately for each start time, as in a trend analysis. hypothesis for mean differences across levels of the
For example, declining correlations or covar- main effects and interaction effects. The sample
iances with increasing time lapses between mea- means for each factor main effect are compared
sures indicate that an AR(1) or ante-dependence to ascertain whether the difference between the
structure is appropriate. For trend analysis, means can be attributed to the factor rather than
trends with the same mean have approximately to chance. Interaction effects are tested to ascertain
the same variance. This pattern can also be whether a difference between the means of the
observed on a graph with lines showing multiple fixed effects between subjects and the means of
trends. If the means or the lines on the graph are each level of the random effects within subjects is
markedly different and the lines do not overlap, significantly different from zero. In other words,
822 Mixed Model Design
the data are examined to ascertain the extent to of significance allows the researcher to reject or
which changes in one factor are observed across retain the null hypothesis that the variance of the
levels of the other factor. random effect is zero in the population. A nonsig-
nificant random effect can be dropped from the
model, and the analysis can be repeated with one
Interpretation of Results
or more other random effects.
Several tables of computer output are produced for Interaction effects between the fixed and ran-
a mixed model design. The tables allow res- dom effects are also included as variance estimates.
earchers to check the fit of the data to the model Interaction effects are interpreted on the basis of
selected and interpret results for the null hypotheses. their levels of significance. For all effects, if the
95% confidence interval contains zero, the respec-
tive effects are nonsignificant. The residual param-
Model Dimension Table
eter estimates the unexplained variance in the
A model dimension table shows the fixed and dependent variable after controlling for fixed
random effects and the number of levels for each, effects, random effects, and interaction effects.
type of covariance structure selected, and the num-
ber of parameters estimated. For example, AR(1)
and compound symmetry covariance matrices
estimate two parameters whereas the number of
parameters varies for a UN based on the number
of repeated measures for a factor. Advantages
The advantages of the mixed model compensate
Information Criteria Table for the complexity of the design. A major advan-
tage is that the requirement of independence of
Goodness-of-fit statistics are displayed in an
individual observations does not need to be met as
information criteria table. Information criteria
in the general linear model or regression proce-
can be compared when different covariance
dures. The groups formed for higher-level analysis
structures and/or estimation methods are speci-
such as in nested designs and repeated measures
fied for the model. The tables resulting from
are assumed to be independent; that is, they are
different models can be used to compare one
assumed to have similar covariance structures. In
model with another. Information criteria are
the mixed model design, a wide variety of covari-
interpreted such that a smaller value means
ance structures may be specified, thus enabling the
a better fit of the data to the model.
researcher to select the covariance structure that
provides the model of best fit. Equal numbers of
Fixed Effects, Random Effects, repeated observations for each subject are not
and Interaction Effects required, making the mixed model design desirable
for balanced and unbalanced designs. Measures
Parameter estimates for the fixed, random, and
for all subjects need not be taken at the same
interaction effects are presented in separate tables.
points in time. All existing data are incorporated
Results of the fixed effects allow the researcher to
into the analysis even though there may be missing
reject or retain the null hypothesis of no relation-
data points for some cases. Finally, mixed model
ship between the fixed factors and the dependent
designs, unlike general linear models, can be
variable. The level of significance (p value) for
applied to data at a lower level that are contained
each fixed effect will indicate the extent to which
(nested) within a higher level, as in hierarchical
the fixed factor or factors have an effect different
linear models.
from zero on the dependent variable.
A table of estimates of covariance parameters Marie Kraska
indicates the extent to which random factors have
an effect on the dependent variable. Random See also Hierarchical Linear Modeling; Latin Square
effects are reported as variance estimates. The level Design; Sphericity; Split-Plot Factorial Design
Mode 823
Together with the mean and the median, the mode it is fairly easy to see that the mode is 3.
is one of the main measurements of the central ten- However, if one were to roll the die 40 times
dency of a sample or a population. The mode is and list the results, the mode is less obvious:
particularly important in social research because it
f6, 5, 5, 4, 4, 1, 6, 6, 3, 4, 4, 4, 2, 5,
is the only measure of central tendency that is rele-
vant for any data set. That being said, it rarely 5, 4, 4, 1, 2, 1, 4, 5, 5, 1, 3, 5, 2, 4,
receives a great deal of attention in statistics 2, 4, 2, 4, 4, 6, 5, 2, 1, 1, 4, 5g:
courses. The purpose of this entry is to identify the
role of the mode in relation to the median and the In Table 1, the data are grouped by frequency,
mean for summarizing various types of data. making it obvious that the mode is 4.
would be the total net worth of a randomly only in order of value. For example, a person
selected sample of individuals. So much variability could be asked to rank items on a scale of 1 to 5
is possible in the possible outcomes that unless the in terms of his or her favorite. Likert-type scales
data are grouped into discrete categories (say are an example of this sort of data. A value of 5 is
increments of $5,000 or $10,000) the mode does greater than a value of 4, but an increment from 4
not summarize the central tendency of the data to 5 does not necessarily represent the same
well by itself. increase in preference that an increase from 1 to 2
Interval data are similar to ratio data in that it does. Because of these characteristics, reporting
is possible to carry out detailed mathematical the median and the mode for this type of data
operations on them. As a result, it is possible to makes sense, but the mean does not.
take the mean and the median as measures of cen- In the case of nominal-scale data, the mode is
tral tendency. However, interval data lack a true the only meaningful measure of central tendency.
0 value. Nominal data tell the analyst nothing about the
For example, in measuring household size, it order of the data. In fact, data values do not even
is conceivable that a household can possess very need to be labeled as numbers. One might be inter-
large numbers of members, but generally this is ested in which number a randomly selected group
rare. Many households have fewer than 10 mem- of hockey players at a hockey camp wear on their
bers; however, some modern extended families jersey back home in their regular hockey league.
might run well into double digits. At the extreme, One could select from two groups:
it is possible to observe medieval royal or aristo-
cratic households with potentially hundreds of Group 1 : f1, 3, 4, 4, 4, 7, 7, 10, 11, 99g;
members. However, it is nonsensical to state that Group 2 : f1, 4, 8, 9, 11, 44, 99, 99, 99, 99g:
a household has zero members. It is also nonsensi-
cal to say that a specific household has 1.75 mem- For these groups, taking the mean and the
bers. However, it is possible to say that the mean median are meaningless as measures of central ten-
household size in a geographic region (a country, dency (the number 11 does not represent more
province, or city) is 2.75 or 2.2 or some other value than 4). However, the mode of Group 1 is 4,
number. For interval data, the mode, the median, and the mode of Group 2 is 99. With some back-
and the mean frequently provide valuable but dif- ground information about hockey, the anal-
ferent information about the central tendencies of yst could hypothesize which populations the two
the data. The mean may be heavily influenced by groups are drawn from. Group 2 appears to be
large low-prevalence values in the data; however, made up of a younger group of players whose
the median and the mode are much less influenced childhood hero is Wayne Gretzky (number 99),
by them. and Group 1 is likely made up of an older group
As an example of the role of the mode in sum- of fans of Bobby Orr (number 4).
marizing interval data, if a researcher were inter- Another interesting property of the mode is that
ested in comparing the household size on different the data do not actually need to be organized as
streets, A Street and B Street, he or she might visit numbers, nor do they need to be translated into
both and record the following household sizes: numbers. For example, an analyst might be inter-
ested in the first names of CEOs of large corp-
A Street : f3, 1, 6, 2, 1, 1, 2, 3, 2, 4, 2, 1, 4g;
orations in the 1950s. Examining a particular
B Street : f2, 5, 3, 6, 7, 9, 3, 4, 1, 2, 1g: newspaper article, the analyst might find the fol-
lowing names:
Comparing the measures of central tendency (A
Street: Mean ¼ 2.5, Median ¼ 2, Mode ¼ 1 and 2;
B Street: Mean ¼ 3.8, Median ¼ 3, Mode ¼ 3) gives Names : fTed, Gerald, John, Martin, John,
a clearer picture of the nature of the streets than Peter, Phil, Peter, Simon, Albert, Johng
any one measure of central tendency in isolation.
For data in ordinal scales, not only is there no In this example, the mode would be John, with
absolute zero, but one can also rank the elements three listings. As this example shows, the mode is
826 Models
particularly useful for textual analysis. Unlike the f30,000, 30,000, 30,000, 30,000, 30,000,
median and the mean, it is possible to take counts 15,000, 15,000, 15,000, 5,000, 5,000g:
of the occurrence of words in documents, speech,
or database files and to carry out an analysis from
Now the mode is $30,000, the median is
such a starting point.
(30,000 þ 15,000)/2 ¼ $22,500, and the mean ¼
Examination of the variation of nominal data
$20,500.
is also possible by examining the frequencies
Finally, the following observations are possible:
of occurrences of entries. In the above example,
it is possible to summarize the results by
f30,000, 30,000, 30,000, 15,000, 15,000,
stating that John represents 3/11 (27.3%) of the
entries, Peter represents 2/11 (18.2%) of the 15,000, 15,000, 15,000, 5,000, 5,000g:
entries and that each other name represents 1/11
(9.1%) of the entries. Through a comparison of In this case, the mode is $15,000, the median is
frequencies of occurrence, a better picture of the $15,000, and the mean is $17,500. The distribu-
distribution of the entries emerges even if one tion has much less skew and is nearly symmetric.
does not have access to other measures of central The concept of skew can be dealt with using
tendency. very complex methods in mathematical statistics;
however, often simply reporting the mode together
with the median and the mean is a useful method
of forming first impressions about the nature of
A Tool for Measuring a distribution.
the Skew of a Distribution
Gregory P. Butler
Skew in a distribution is a complex topic; however,
comparing the mode to the median and the mean See also Central Tendency, Measures of; Interval Scale;
can be useful as a simple method of determining Likert Scaling; Mean; Median; Nominal Scale
whether data in a distribution are skewed. A sim-
ple example is as follows. In a workplace a com-
Further Readings
pany offers free college tuition for the children of
its employees, and the administrator of the plan Dodge, Y. (1993). Statistique: Dictionnaire
is interested in the amount of tuition that the pro- encyclopédique [Statistics: Encyclopedic dictionary].
gram may have to pay. For simplicity, take three Paris: Dunod.
Magnello, M. E. (2009). Karl Pearson and the
different levels of tuition. Private university tuition
establishment of mathematical statistics. International
is set at $30,000, out-of-state-student public uni-
Statistical Review, 77(1), 3–29.
versity tuition is set at $15,000, and in-state tuition Pearson, K. (1895). Contribution to the mathematical
is set at $5,000. theory of evolution: Skew variation in homogenous
In this example, 10 students qualify for tuition material. Philosophical Transactions of the Royal
coverage for the current year. The distribution of Society of London, 186(1), 344–434.
tuition amounts for each student is as follows: Stevens, S. S. (1946). On the theory of scales of
measurement. Science, 103(2684), 677–680.
f5,000, 5,000, 5,000, 5,000, 5,000, 15,000,
15,000, 15,000, 30,000, 30,000g
MODELS
Hence, the mode is $5,000. The median
is equal to (15,000 þ 5,000)/2 ¼ $10,000. The It used to be said that models were dispensable aids
mean ¼ $13,000. The tuition costs are skewed to formulating and understanding scientific theo-
toward low tuition, and therefore the distribu- ries, perhaps even props for poor thinkers. This
tion has a negative skew. negative view of the cognitive value of models in
A positively skewed distribution would occur if science contrasts with today’s view that they are an
the tuition payments were as follows: essential part of the development of theories, and
Models 827
more besides. Contemporary studies of scientific form. A scale model of an aircraft prototype, for
practice make it clear that models play genuine example, may be built to test its basic aerody-
and indispensable cognitive roles in science, provid- namic features in a wind tunnel.
ing a basis for scientific reasoning. This entry
describes types and functions of models commonly
Analogue Models
used in scientific research.
Analogue, or analogical, models express relevant
relations of analogy between the model and the
Types of Models
reality being represented. Analogue models are
Given that just about anything can be a model important in the development of scientific theories.
of something for someone, there is an enormous The requirement for analogical modeling often
diversity of models in science. The many senses of stems from the need to learn about the nature of
the word model that stem from this bewildering hidden entities postulated by a theory. Analogue
variety Max Wartofsky has referred to as the models also serve to assess the plausibility of our
‘‘model muddle.’’ It is not surprising, then, that the new understanding of those entities.
wide diversity of models in science has not been Analogical models employ the pragmatic strat-
captured by some unitary account. However, philo- egy of conceiving of unknown causal mechanisms
sophers such as Max Black, Peter Achinstein, and in terms of what is already familiar and well under-
Rom Harré have provided useful typologies that stood. Well-known examples of models that have
impose some order on the variety of available mod- resulted from this strategy are the molecular model
els. Here, discussion is confined to four different of gases, based on an analogy with billiard balls in
types of model that are used in science: scale mod- a container; the model of natural selection, based
els, analogue models, mathematical models, and on an analogy with artificial selection; and, the
theoretical models. computational model of the mind, based on an
analogy with the computer.
To understand the nature of analogical model-
Scale Models
ing, it is helpful to distinguish between a model,
As their name suggests, scale models involve the source of the model, and the subject of the
a change of scale. They are always models of model. From the known nature and behavior of
something, and they typically reduce selected the source, one builds an analogue model of the
properties of the objects they represent. Thus, unknown subject or causal mechanism. To take
a model airplane stands as a miniaturized repre- the biological example just noted, Charles
sentation of a real airplane. However, scale mod- Darwin fashioned his model of the subject of
els can stand as a magnified representation of an natural selection by reasoning analogically from
object, such as a small insect. Although scale the source of the known nature and behavior of
models are constructed to provide a good resem- the process of artificial selection. In this way,
blance to the object or property being modeled, analogue models play an important creative role
they represent only selected relevant features of in theory development. However, this role
the object. Thus, a model airplane will almost requires the source from which the model is
always represent the fuselage and wings of the drawn to be different from the subject that
real airplane being modeled, but it will seldom is modeled. For example, the modern computer
represent the interior of the aircraft. Scale mod- is a well-known source for the modeling of
els are a class of iconic models because they liter- human cognition, although our cognitive appa-
ally depict the features of interest in the original. ratus is not generally thought to be a real com-
However, not all iconic models are scale models, puter. Models in which the source and the
as for example James Watson and Francis Crick’s subject are different are sometimes called para-
physical model of the helical structure of the morphs. Models in which the source and the
DNA molecule. Scale models are usually built in subject are the same are sometimes called home-
order to present the properties of interest in the omorphs. The paramorph can be an iconic, or
original object in an accessible and manipulable pictorial, representation of real or imagined
828 Models
Models and Theories can be argued that in science, models and theories
are different representational devices. Consistent
The relationship between models and theories is
with this distinction between models and theories,
difficult to draw, particularly given that they can
William Wimsatt has argued that science often
both be conceptualized in different ways. Some
adopts a deliberate strategy of adopting false mod-
have suggested that theories are intended as true
els as a means by which we can obtain truer theo-
descriptions of the real world, whereas models
ries. This is done by localizing errors in models in
need not be about the world, and therefore need
order eliminate other errors in theories.
not be true. Others have drawn the distinction by
claiming that theories are more abstract and gen-
eral than models. For example, evolutionary psy- Abstraction and Idealization
chological theory can be taken as a prototype for
It is often said that models provide a simplified
the more specific models it engenders, such as
depiction of the complex domains they often
those of differential parental investment and the
represent. The simplification is usually achieved
evolution of brain size. Relatedly, Ronald Giere
through two processes: abstraction and idealiza-
has argued that a scientific theory is best under-
tion. Abstraction involves the deliberate elimina-
stood as comprising a family of models and a set
tion of those properties of the target that are not
of theoretical hypotheses that identify things in the
considered essential to the understanding of that
world that apply to a model in the family.
target. This can be achieved in various ways; for
Yet another characterization of models takes
example, one can ignore the properties, even
them to be largely independent of theories. In argu-
though they continue to exist; one can eliminate
ing that models are ‘‘autonomous agents’’ that
them in controlled experiments; or one can set the
mediate between theories and phenomena, Mar-
values of unwanted variables to zero in simula-
garet Morrison contends that they are not fully
tions. By contrast, idealization involves transform-
derived from theory or data. Instead, they are tech-
ing a property in a system into one that is related,
nologies that allow one to connect abstract theories
but which possesses desirable features introduced
with empirical phenomena. Some have suggested
by the modeler. Taking a spheroid object to be
that the idea of models as mediators does not apply
spherical, representing a curvilinear relation in lin-
to the behavioral and biological sciences because
ear form, and assuming that an agent is perfectly
there is no appreciable gap between fundamental
rational are all examples of idealization. Although
theory and phenomena in which models can
the terms abstraction and idealization are some-
mediate.
time used interchangeably, they clearly refer to dif-
ferent processes. Each can take place without the
other, and idealization can in fact take place with-
The Functions of Models out simplification.
Representation Brian D. Haig
Models can variously be used for the purposes
See also A Priori Monte Carlo Simulation; Exploratory
of systematization, explanation, prediction, con-
Factor Analysis; General Linear Model; Hierarchical
trol, calculation, derivation, and so on. In good
Linear Modeling; Latent Growth Modeling; Multilevel
part, models serve these purposes because they can
Modeling; Scientific Method; Structural Equation
often be taken as devices that represent parts of
Modeling
the world. In science, representation is arguably
the main function of models. However, unlike sci-
entific theories, models are generally not thought Further Readings
to be the sort of things that can be true or false. Abrantes, P. (1999). Analogical reasoning and modeling
Instead, we may think of models as having a kind in the sciences. Foundations of Science, 4, 237–270.
of similarity relationship with the object that is Black, M. (1962). Models and metaphors: Studies in
being modeled. With analogical models, for exam- language and philosophy. Ithaca, NY: Cornell
ple, the similarity relationship is one of analogy. It University Press.
830 Monte Carlo Simulation
Giere, R. (1988). Explaining science. Chicago: University A Monte Carlo simulation study is a systematic
of Chicago Press. investigation of the properties of some quantitative
Harré, R. (1976). The constructive role of models. In method under a variety of conditions in which
L. Collins (Ed.), The use of models in the social a set of Monte Carlo simulations is performed.
sciences (pp. 16–43). London: Tavistock.
Thus, a Monte Carlo simulation study consists of
MacCallum, R. C. (2003). Working with imperfect
models. Multivariate Behavioral Research, 38,
the findings from applying a Monte Carlo simula-
113–139. tion to a variety of conditions. The goal of a Monte
Morgan, M., & Morrison, M. (Eds.). (1999). Models as Carlo simulation study is often to make general
mediators. Cambridge, UK: Cambridge University statements about the various properties of the
Press. quantitative method under a wide range of situa-
Suppes, P. (1962). Models of data. In E. Nagel, P. Suppes, tions. So as to discern the properties of the
& A. Tarski (Eds.). Logic, methodology, and quantitative method generally, and to search for
philosophy of science: Proceedings of the 1960 inter-action effects in particular, a fully crossed fac-
International Congress (pp. 252–261). Stanford, CA:
torial design is often used, and a Monte Carlo sim-
Stanford University Press.
ulation is performed for each combination of the
Wartofsky, M. (1979). Models: Representation and the
scientific understanding. Dordrecht, the Netherlands: situations in the factorial design. After the data
Reidel. have been collected from the Monte Carlo simula-
Wimsatt, W. C. (1987). False models as means to truer tion study, analysis of the data is necessary so that
theories. In M. Nitecki & A. Hoffman (Eds.), Neutral the properties of the quantitative procedure can be
models in biology (pp. 23–55). London: Oxford discerned. Because such a large number of replica-
University Press. tions (e.g., 10,000) are performed for each condi-
tion, the summary findings from the Monte Carlo
simulations are often regarded as essentially popu-
lation values, although confidence intervals for the
estimates is desirable.
MONTE CARLO SIMULATION The general rationale of Monte Carlo simula-
tions is to assess various properties of estimators
A Monte Carlo simulation is a methodological and/or procedures that are not otherwise mathe-
technique used to evaluate the empirical properties matically tractable. A special case of this is com-
of some quantitative method by generating ran- paring the nominal and empirical values (e.g.,
dom data from a population with known proper- Type I error rate, statistical power, standard error)
ties, fitting a particular model to the generated of a quantitative method. Nominal values are
data, collecting relevant information of interest, those that are specified by the analyst (i.e., they
and replicating the entire procedure a large num- represent the desired), whereas empirical values
ber of times (e.g., 10,000) in order to obtain prop- are those observed (i.e., they represent the actual)
erties of the fitted model under the specified from the Monte Carlo simulation study. Ideally,
condition(s). Monte Carlo simulations are gener- the nominal and empirical values are equivalent,
ally used when analytic properties of the model but this is not always the case. Verification that the
under the specified conditions are not known or nominal and empirical values are consistent can be
are unattainable. Such is often the case when no the primary motivation for using a Monte Carlo
closed-form solutions exist, either theoretically or simulation study.
given the current state of knowledge, for the As an example, under certain assumptions the
particular method under the set of conditions of standardized mean difference follows a known dis-
interest. When analytic properties are known for tribution, which in this case allows for exact ana-
a particular set of conditions, Monte Carlo simula- lytic confidence intervals to be constructed for the
tion is unnecessary. Due to the computational population standardized mean difference. One of
tediousness of Monte Carlo methods because of the assumptions on which the analytic procedure
the large number of calculations necessary, in prac- is based is that in the population, the scores within
tice they are essentially always implemented with each of the two groups distribute normally. In
one or more computers. order to evaluate the effectiveness of the (analytic)
Monte Carlo Simulation 831
approach to confidence interval formation when procedure or model works in the specified input
the normality assumption is not satisfied, Ken Kel- conditions.
ley implemented a Monte Carlo simulation study A particular implementation of the Monte
and compared the nominal and empirical confi- Carlo method is a method known as Markov
dence interval coverage rates. Kelley also com- Chain Monte Carlo, which is a method used
pared the analytic approach to confidence interval to sample from various probability distributions
formation using two bootstrap approaches so as to based on a specified model in order to form sam-
determine whether the bootstrap performed better ple means for approximating expectations. Mar-
than the analytic approach under certain types of kov Chain Monte Carlo techniques are most
nonnormal data. Such comparisons require Monte often used in the Bayesian approach to statistical
Carlo simulation studies because no formula-based inference, but they can also be used in the fre-
comparisons are available as the analytic proce- quentist approach.
dure is based on the normality assumption, which The term Monte Carlo was coined in the mid-
was (purposely) not realized in the Monte Carlo 1940s by Nicholas Metropolis while working at
simulation study. the Los Alamos National Laboratory with Stanis-
As another example, under certain assumptions law Ulam and John von Neumann, who proposed
and an asymptotically large sample size, the sam- the general idea and formalized how determinate
ple root mean square error of approximation mathematical problems could be solved with ran-
(RMSEA) follows a known distribution, which dom sampling from a specified model a large num-
allows confidence intervals to be constructed for ber of times, because of the games of chance
the population RMSEA. However, the effective- commonly played in Monte Carlo, Monaco, with
ness of the confidence interval procedure had not the idea of repeating a process a larger number
been well known for finite, and in particular small, of times and then examining the outcomes. The
sample sizes. Patrick Curran and colleagues have Monte Carlo method essentially replaced what
evaluated the effectiveness of the (analytic) con- was previously termed statistical sampling. Statisti-
fidence interval procedure for the population cal sampling was used famously by William Sealy
RMSEA by specifying a model with a known Gossett, who published under the name Student,
population RMSEA, generating data, forming before finalizing the statistical theory of the t dis-
a confidence interval for the population RMSEA, tribution and was reported in his paper to show
and replicating the procedure a large number of a comparison of empirical and nominal properties
times. Of interest was the bias when estimating the of the t distribution.
population RMSEA from sample data and the pro-
portion of confidence intervals that correctly Ken Kelley
bracketed the known population RMSEA, so as to
See also A Priori Monte Carlo Simulation; Law of Large
determine whether the empirical confidence inter-
Numbers; Normality Assumption
val coverage was equal to the nominal confidence
interval coverage (e.g., 90%).
A Monte Carlo simulation is a special case of
Further Readings
a more general method termed the Monte Carlo
method. The Monte Carlo method, in general, Browne, M. W., & Cudeck, R. (1993). Alternative ways
uses many sets of randomly generated data under of assessing model fit. In K. A. Bollen & J. S. Long
some input specifications and applies a particular (Eds.), Testing structural equation models (pp. 136–
procedure or model to each set of the randomly 162). Newbury Park, CA: Sage.
generated data so that the output of interest from Currran, P. J., Bollen, K. A., Chen, F., Paxton, P., &
Kirby, J. B. (2003). Finite sampling properties of the
each fit of the procedure or model to the randomly
point estimates and confidence intervals of the
generated data can be obtained and evaluated. RMSEA. Sociological Methods Research, 32,
Because of the large number of results of interest 208–252.
from the fitted procedure or model to the Gilks, W. R., Richardson, S., & Spiegelhalter, D. J.
randomly generated data sets, the summary of (1996). Introducing Markov chain Monte Carlo. In
the results describes the properties of how the W. R. Gilks, S. Richardson, & D. J. Spiegelhalter
832 Mortality
(Eds.), Markov chain Monte Carlo in practice sex and age. Important subgroup mortality rates,
(pp. 1–20). New York: Chapman & Hall. as recognized by the World Health Organization,
Kelley, K. (2005). The effects of nonnormal distributions include the neonatal mortality rate, or deaths
on confidence intervals around the standardized mean during the first 28 days of life per 1,000 live
difference: Bootstrap and parametric confidence
births; the infant mortality rate, or the probabil-
intervals. Educational & Psychological Measurement,
65, 51–69.
ity of a child born in a specific year or period
Metropolis, N. (1987). The beginning of the Monte dying before reaching the age of 1 year; and the
Carlo method. Los Alamos Science, 125–130. maternal mortality rate, or the number of mater-
Student. (1908). The probable error of the mean. nal deaths due to childbearing per 100,000 live
Biometrika, 6, 1–25. births. The adult mortality rate refers to death
rate between 15 and 60 years of age. Age-specific
mortality rates refer to the number of deaths in
a year (per 100,000 individuals) for individuals
MORTALITY of a certain age bracket. In comparing mortality
rates between groups, age and other demograph-
Mortality refers to death as a study endpoint or ics must be borne in mind. Mortality rates may
outcome. Broader aspects of the study of death be standardized to adjust for differences in the
and dying are embraced in the term thanatology. age distributions of populations.
Survival is an antonym for mortality. Mortality
may be an outcome variable in populations or
Use in Research Studies
samples, associated with treatments or risk fac-
tors. It may be a confounder of other outcomes Mortality and survival are central outcomes in
due to resultant missing data or to biases a variety of research settings.
induced when attrition due to death results in
structural changes in a sample. Mortality is an Clinical Trials
event that establishes a metric for the end of the
life span. Time to death is frequently used as an In clinical trials studying treatments for life-
outcome and, less frequently, as a predictor vari- threatening illnesses, survival rate is often the pri-
able. This entry discusses the use and analysis of mary outcome measure. Survival rate is evaluated
mortality data in research studies. as 1 minus the corresponding mortality rate. Ran-
domized controlled trials are used to compare sur-
vival rates in patients receiving a new treatment to
Population Mortality Rates that in patients receiving a standard or placebo
Nearly all governments maintain records of treatment. The latter is commonly known as the
deaths. Thus many studies of mortality are based control group. Such trials should be designed to
on populations rather than samples. The most recruit sufficient numbers of patients and to follow
common index of death in a specific group is its them for long enough to observe deaths likely to
mortality rate. Interpretation of a mortality rate occur due to the illness to ensure adequate statisti-
requires definition of the time, causes of death, cal power to detect differences in rates.
and groups involved. Mortality rates are usually
specified as the number of deaths in a year per Epidemiological Studies
1,000 individuals, or in circumstances where
mortality is rarer, per 100,000 individuals. A Epidemiological studies of mortality compare
mortality rate may be cause specific, that is, refer rates of death across different groups defined by
to death due to a single condition, such as a dis- demographic measures, by risk factors, by expo-
ease or type of event or exposure. All-cause mor- sures, or by location.
tality refers to all deaths regardless of their
Prospective Studies
cause. Mortality rates are often calculated for
whole populations but can be expected to vary Many studies make an initial assessment of a
as a function of demographic variables, notably sample of interest and follow up with participants
Mortality 833
Hosmer, D. W., Lemeshow, S., & May, S. (2008). History and Advantages
Applied survival analysis: Regression modeling of time
to event data. New York: Wiley Interscience. Statistical analyses conducted within an MLM
Ripatti, S., Gatz, M., Pedersen, N. L., & Palmgren, J. framework date back to the late 19th century and
(2003). Three-state frailty model for age at onset of
the work of George Airy in astronomy, but
dementia and death in Swedish twins. Genetic
the basic specifications used today were greatly
Epidemiology, 24, 139–149.
Schlesselman, J. J. (1982). Case-control studies: advanced in the 20th century by Ronald Fisher
Design, conduct, analysis. New York: Oxford and Churchill Eisenhart’s introduction of fixed-
University Press. and random-effects modeling. MLM permits the
Whalley, L. J., & Deary, I. J. (2001). Longitudinal analysis of interdependent data without violating
cohort study of childhood IQ and survival up to age the assumptions of standard multiple regression. A
76. British Medical Journal, 322, 819–822. critical statistic for determining the degree of inter-
World Health Organization. (2008). WHO Statistical relatedness in one’s data is the intraclass correla-
Information System: Indicator definitions and tion (ICC). The ICC is calculated as the ratio of
metadata, 2008. Retrieved August 15, 2008, from
between-group variance to between-subject vari-
http://www.who.int/whosis/indicators/compendium/
2008/en
ance, divided by total variance. The degree to
which the ICC affects alpha levels is dependent on
the size of a sample; small ICCs inflate alpha in
large samples, whereas large ICCs will inflate
alpha in small samples. A high ICC suggests that
MULTILEVEL MODELING the assumption of independence is violated. When
the ICC is high, using traditional methods such as
Multilevel modeling (MLM) is a regression-based multiple linear regression is problematic because
approach for handling nested and clustered data. ignoring the interdependence in the data will often
Nested data (sometimes referred to as person– yield biased results by artificially inflating the sam-
period data) occurs when research designs include ple size in the analysis, which can lead to statisti-
multiple measurements for each individual, and cally significant findings that are not based on
this approach allows researchers to examine how random sampling. In addition, it is important to
participants differ, as well as how individuals vary account for the nested structure of the data—that
across measurement periods. A good example of is, nonindependence—to generate an accurate
nested data is repeated measurements taken from model of the variation in the data that is due to
people over time; in this situation, the repeated differences between groups and between subjects
measurements are nested under each person. Clus- after accounting for within differences within
tered data involves a hierarchical structure such groups and within subjects. Because variation
that individuals in the same group are hypothe- within groups and within individuals usually
sized to be more similar to each other than to accounts for most of the total variance, disregard-
other groups. A good example of clustered data is ing this information will bias these estimates.
the study of classrooms within different schools; in In addition to its ability to handle nonindepen-
this situation, classrooms are embedded within the dent data, an advantage of MLM is that more
schools. Standard (ordinary least squares [OLS]) traditional approaches for studying repeated mea-
regression approaches assume that each obser- sures, such as repeated measures analysis of vari-
vation in a data set is independent. Thus, it is ance (ANOVA), assume that data are completely
immediately obvious that nested and hierarchically balanced with, for example, the same number of
structured data violate this assumption of indepen- students per classroom or equivalent measure-
dence. MLM techniques arose to address this limi- ments for each individual. Missing data or unbal-
tation of OLS regression. As discussed below, anced designs cannot be accommodated with
however, most of the common MLM techniques repeated measures ANOVA and are dropped from
are extensions of OLS regression and are accessible further analysis. MLM techniques were designed
to anyone with a basic working knowledge of mul- to use an iterative process of model estimation by
tiple regression. which all data can be used in analysis; the two
836 Multilevel Modeling
most common approaches are maximum and common structures are person–period data and
restricted maximum likelihood estimation (both of clustered data. Person–period data examine both
which are discussed later in this entry). between- and within-individual variation, with the
latter examining how an individual varies across
a measurement period. Most studies having this
Important Distinctions
design include longitudinal data that examines
Multilevel models are also referred to as hierarchi- individual growth. An example might be a daily
cal linear models, mixed models, general linear diary study in which each diary entry or measure-
mixed models, latent curve growth models, vari- ment (Level 1) is nested within an individual
ance components analysis, random coefficients (Level 2). Ostensibly, a researcher might be inter-
models, or nested or clustered models. These terms ested in examining change within an individual’s
are appropriate and correct, depending on the field daily measurements across time or might be inves-
of study, but the multiple names can also lead to tigating how daily ratings differ as a function of
confusion and apprehension associated with using between-individual factors such as personality,
this form of statistical analysis. For instance, the intelligence, age, and so forth. Thus, at the within-
hierarchical linear model 6.0, developed by Tony person level (Level 1), there may be predictors
Bryk and Steve Raudenbush, is a statistical pro- associated with an individual’s rating at any given
gram that can handle both nested and clustered occasion, but between-person (Level 2) variables
data sets, but MLM can be conducted in other may also exist that moderate the strength of that
popular statistical programs, including SAS, SPSS association. As opposed to assuming that indivi-
(an IBM company, formerly called PASWâ Statis- duals’ responses are independent, the assumption
tics), and R, as well as a host of software for ana- of MLM is that these responses are inherently
lyzing structural equation models (e.g., LISREL, related and more similar within an individual than
MPlus). MLM can be used for nonlinear models as they are across individuals.
well, such as those associated with trajectories of A similar logic applies for clustered data. A
change and growth, which is why the terms hierar- common example of hierarchically structured data
chical linear modeling and general linear mixed in the education literature assumes that students in
models can be misleading. In addition, latent curve the same classroom will be more similar to each
growth analysis, a structural equation modeling other than to students in another class. This might
technique used to fit different curves associated result from being exposed to the same teacher,
with individual trajectories of change, is statisti- materials, class activities, teaching approach, and
cally identical to regression-based MLM. The dif- so forth. Thus, students’ individual responses
ferent approaches recognize that data is not always (Level 1) are considered nested within classrooms
nested within a hierarchical structure, and also that (Level 2). A researcher might be interested in
data may exhibit nonlinear trajectories such as examining individual students’ performance on an
quadratic or discontinuous change. MLM also is arithmetic test to ascertain if variability is due to
referred to as mixed models analysis by resear- differences among students (Level 1) or between
chers interested in the differences between subjects classrooms (Level 2). Similar to its use with
and groups and within subjects and groups that person–period data, MLM in this example can
account for variance in their data. Finally, variance investigate within-classroom variability at the low-
components analysis and random coefficients mod- est level and between-classroom variability at the
els refers to variance that is assumed to be random highest level of the hierarchy.
across groups or individuals as opposed to fixed, as The main distinction between these data struc-
is assumed in single-level regression; MLM is tures rests in the information that is gleaned for
referred to by this terminology as well. data analysis. For person–period data, one can
make inferences regarding variability within per-
son responses or trajectories of change over a time,
Person–Period and Clustered Data
which may help answer questions related to the
As mentioned, MLM can be used flexibly with study of change. For clustered data, one can study
multiple types of data structures. Two of the most differences among and within groups, which may
Multilevel Modeling 837
help answer questions regarding program evalua- estimates the individual growth parameters of the
tion. Perhaps not surprisingly, these two structures intercept and slope for each individual by the fol-
also can be combined, such as when multiple arith- lowing equations:
metic exams are given over time (Level 1) and
nested within each student (Level 2) who remains Level 2 : β0i ¼ γ00 þ ζ0i
assigned to a classroom (Level 3). A thorough dis- β1i ¼ γ10 þ ζ1i ð2Þ
cussion of this three-level example is beyond the
scope of the present entry, but the topic is raised where ζ0i and ζ1i indicate that the Level 2 out-
as an example of the flexibility and sophistication comes (β0i and β1i, the intercept and the slope
of MLM. from the Level 1 model) each have a residual term,
while γ00 represents the grand mean and γ10 indi-
The Multilevel Model cates the grand slope for the sample. This means
that the intercept and slope are expected to vary
This discussion follows the formal notation intro- across individuals and will deviate from the aver-
duced by Bryk and Raudenbush, Judith Singer, age intercept and slope of the entire sample.
John Willet, and others. A standard two-level By substituting the Level 2 equations into the
equation for the lower and higher levels of a hierar- Level 1 equation, one can derive the collapsed
chy that includes a predictor at Level 1 is pre- model:
sented first, followed by a combined equation
showing the collapsed single-level model. This last Yij ¼ ½ðγ00 þ ζ0i Þ þ ðγ10 þ ζ1i Þ þ εij
step is important because, depending on which ð3Þ
software is chosen, the two-level model (e.g., Yij ¼ ðγ00 þ γ10Þ þ ðζ1i þ ζ0i þ εij Þ:
HLM 6.0) or the collapsed model (e.g., SAS Proc-
Mixed) may require an explicit equation. It is
important to note, also, that these equations are Types of Questions Answered
the same for any two-level person–period or clus- Using Multilevel Models
tered data set. This section details the most useful and common
The two-level model is presented below in its approaches for examining multilevel data and is
simplest form: organized in a stepwise fashion, with each con-
sequent model adding more information and
Level 1 : Yij ¼ β0i þ β1i þ εij , ð1Þ
complexity.
where i refers to individual and j refers to time, β0i
is the intercept for this linear model, and β1i is the
Unconditional Means Model
slope for the trajectory of change. Notice that
the Level 1 equation looks almost identical to the This approach is analogous to a one-way
equation used for a simple linear regression. The ANOVA examining the random effect, or variance
main differences are the error term εij and the int- in means, across individuals in a person–period
roduction of subscripts i and j. The error term sig- data set, or across groups with clustered data. This
nifies random measurement error associated with model is run without any predictors at Level 1,
data that, contrary to the slope, deviate from line- which is equivalent to the model included in Equa-
arity. For the earlier daily diary (individual–period) tion 1 without any Level 1 predictors (i.e., the
example, the Level 1 equation details that individ- unconditional means model with only the intercept
ual i’s rating at time j is dependent on his or her and error term). It is by running this model that
first rating (β0i) and the slope of linear change one can determine the ICC and assess whether
(β1i) between time (or occasion) 1 and time j (note a multilevel analysis is indeed warranted. Thus,
that β1i does not always represent time, but more the unconditional means model provides an esti-
generally represents a 1-unit change from baseline mate of how much variance exists between groups
in the time-varying Level 1 predictor; the equa- and between subjects, as well as within groups and
tions are identical, however). The Level 2 equation within subjects in the sample.
838 Multilevel Modeling
Random Intercept and Slope ratings may be more similar on Days 1 and 2 com-
pared with Days 1 and 15.
This approach is ideal for data in which sub-
jects are measured on different time schedules, and
it allows each subject to deviate from the popula- Model Estimation Methods
tion in terms of both intercept and slope. Thus,
As stated above, MLM differs from repeated
each individual can have a different growth trajec-
measures ANOVA in that it uses all available data.
tory even if the hypothesis is that the population
To accomplish this, most statistical packages use
will approximate similar shapes in its growth tra-
a form of maximum likelihood (ML) estimation.
jectory. Using this method, one can specify the resi-
ML estimations are favored because it is assumed
duals for the intercept and slopes to be zero or
that they converge on population estimates, that
nonzero and alter the variance structures by group
the sampling distribution is equivalent to the
such that they are equal or unbalanced.
known variance, and that the standard error
derived from the use of this method is smaller than
from other approaches. These advantages apply
Partitioning Variance Within Groups
only with large samples because ML estimations
are biased toward large samples and their variance
Unstructured
estimation may become unreliable with a smaller
An alternative approach is to estimate the within- data set. There are two types of ML techniques,
group random effects. This approach assumes full ML and restricted ML. Full ML assumes that
that each subject and/or group is independent the dependent variable is normally distributed, and
with equivalent variance components. In this the mean is based on the regression coefficients
approach, the variance can differ at any time, and the variance components. Restricted ML uses
and covariance can exist between all the vari- the least squares residuals that remain after the
ance components. This is the default within- influence of the fixed effects are removed and only
groups variance structure in most statistical soft- the variance components remain. With ML algo-
ware packages and should serve as a starting rithms, a statistic of fit is usually compared across
point in model testing unless theory or experi- models to reveal which model best accounts
mental design favor another approach. for variance in the dependent variable (e.g., the
deviance statistic). Multiple authors suggest that
Within-Group Compound Symmetric when using restricted ML, one should make sure
that models include the same fixed effects and that
This approach constrains the variance and
only the random effects vary, because one wants to
covariance to a single value. Doing so assumes that
make sure the fixed effects are accounted for
the variance is the same regardless of the time the
equivalently across models.
individual was measured or the subject within the
A second class of estimation methods are exten-
group and that the correlation between measure-
sions of OLS estimation. Generalized least squares
ments will be equivalent. A subspecification of this
(GLS) estimation allows the residuals to be autocor-
error structure is the heterogeneous compound
related and have more dispersion of the variances,
symmetric that dictates the variance is a single
but it requires that the actual amount of autocorre-
value, but the covariance between measurements
lation and dispersion be known in the population
can differ.
in order for one to accurately estimate the true
error in the covariance matrix. In order to account
Autoregressive
for this, GLS uses the estimated error covariance
Perhaps the most useful error structure for longi- matrix as the true error covariance matrix, and
tudinal data, this approach dictates that variance is then it estimates the fixed effects and associated
the same at all times but that covariance decreases standard errors. Another approach, iterative GLS,
as measurement occasions are further apart. From is merely an extension of GLS and uses iterations
a theoretical standpoint, this may not be the case in that repeatedly estimate and refit the model until
clustered data sets, but one can see how a person’s either the model is ideally converged or the
840 Multiple Comparison Tests
maximum number of iterations has occurred. This Singer, J. D. (1998). Using SAS PROC MIXED to fit
method also works only with large and relatively multilevel models, hierarchical models, and individual
balanced data. growth models. Journal of Educational & Behavioral
Statistics, 23, 323–355.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal
Other Applications data analysis: Modeling change and event occurrence.
New York: Oxford University Press.
The applications of MLM techniques are virtually Wallace, D., & Green, S. B. (2002). Analysis of repeated
limitless. Dyadic analytic methods are being used measures designs with linear mixed models. In D. S.
in MLM software, with individuals considered Moskowitz & S. L. Hershberger (Eds.), Modeling
nested within dyads. Mediation and moderation intraindividual variability with repeated measures
analysis are possible both within levels of the hier- data: Methods and applications (pp. 103–134).
archy and across levels. Moderated mediation and Mahwah, NJ: Lawrence Erlbaum.
mediated moderation principles, relatively new to
the literature, also can be applied within a MLM Websites
framework. Scientific Software International. Hierarchical
Linear and Nonlinear Modeling (HLM):
Resources http://www.ssicentral.com/hlm
UCLA Stat Computing Portal:
MLM workshops are offered by many private http://statcomp.ats.ucla.edu
companies and university-based educational pro-
grams. Articles on MLM are easily located on
academic databases. Another resource is the Uni-
versity of California–Los Angeles Stat Comput- MULTIPLE COMPARISON TESTS
ing Portal, which has links to pages and articles
of interest directly related to different aspects of Many research projects involve testing multiple
MLM. Finally, those interested in exploring research hypotheses. These research hypotheses
MLM will find it easily accessible as some pro- could be evaluated using comparisons of means,
grams are designed specifically for MLM analy- bivariate correlations, regressions, and so forth, and
ses (e.g., HLM 6.0), but many of the more in fact most studies consist of a mixture of different
commonly used statistical packages (e.g., SAS types of test statistics. An important consideration
and SPSS) have the same capabilities. when conducting multiple tests of significance is
how to deal with the increased likelihood (relative
Lauren A. Lee and David A. Sbarra to conducting a single test of significance) of falsely
declaring one (or more) hypotheses statistically sig-
See also Analysis of Variance (ANOVA); Growth Curve;
nificant, titled the multiple comparisons problem.
Hierarchical Linear Modeling; Intraclass Correlation;
This multiple comparisons problem is especially rel-
Latent Growth Modeling; Longitudinal Design;
evant to the topic of research design because the
Mixed- and Random-Effects Models; Mixed Model
issues associated with the multiple comparisons
Design; Nested Factor Design; Regression Artifacts;
problem relate directly to designing studies (i.e.,
Repeated Measures Design; Structural Equation
number and nature of variables to include) and
Modeling; Time-Series Study; Variance
deriving a data analysis strategy for the study.
This entry introduces the multiple comparisons
Further Readings
problem and discusses some of the strategies that
Heck, R. H., & Thomas, S. L. (2000). An introduction to have been proposed for dealing with it.
multilevel modeling techniques. Mahwah, NJ:
Lawrence Erlbaum.
Kreft, I., & de Leeuw, J. (1998). Introducing multilevel The Multiple Comparisons Problem
modeling. Thousand Oaks, CA: Sage.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical To help clarify the multiple comparisons problem,
linear model: Applications and data analysis methods imagine a soldier who needed to cross fields con-
(2nd ed.). Thousand Oaks, CA: Sage. taining land mines in order to obtain supplies. It is
Multiple Comparison Tests 841
clear that the more fields the individual crosses, in linear models; (h) evaluating multiple para-
the greater the probability that he or she will acti- meters simultaneously in a structural equation
vate a land mine; likewise, researchers conducting model; and (i) analyzing multiple brain voxels for
many tests of significance have an increased chance stimulation in functional magnetic resonance
of erroneously finding tests significant. It is impor- imaging research. Further, as stated previously,
tant to note that although the issue of multiple most studies involve a mixture of many different
hypothesis tests has been labeled the multiple com- types of test statistics.
parisons problem, most likely because a lot of the An important factor in understanding the multi-
research on multiple comparisons has come within ple comparisons problem is understanding the dif-
the framework of mean comparisons, it applies to ferent ways in which a researcher can ‘‘group’’ his
any situation in which multiple tests of significance or her tests of significance. For example, suppose,
are being performed. in the study looking at whether student ratings dif-
Imagine that a researcher is interested in deter- fer across instruction formats, that there was also
mining whether overall course ratings differ for lec- another independent variable, the sex of the ins-
ture, seminar, or computer-mediated instruction tructor. There would now be two ‘‘main effect’’
formats. In this type of experiment, researchers are variables (instruction format and sex of the ins-
often interested in whether significant differences tructor) and potentially an interaction between
exist between any pair of formats, for example, do instruction format and sex of the instructor. The
the ratings of students in lecture-format classes dif- researcher might want to group the hypotheses
fer from the ratings of students in seminar-format tested under each of the main effect (e.g., pairwise
classes. The multiple comparisons problem in this comparisons) and interaction (e.g., simple effect
situation is that in order to compare each format in tests) hypotheses into separate ‘‘families’’ (groups
a pairwise manner, three tests of significance need of related hypotheses) that are considered simulta-
to be conducted (i.e., comparing the means of lec- neously in the decision process. Therefore, control
ture vs. seminar, lecture vs. computer-mediated, of the Type I error (error of rejecting a true null
and seminar vs. computer-mediated instruction for- hypothesis) rate might be imposed separately for
mats). There are numerous ways of addressing the each family, or in other words, the Type I error
multiple comparisons problem and dealing with the rate for each of the main effect and interaction
increased likelihood of falsely declaring tests families is maintained at α. On the other hand, the
significant. researcher may prefer to treat the entire set of tests
for all main effects and interactions as one family,
depending on the nature of the analyses and the
Common Multiple Testing Situations
way in which inferences regarding the results will
There are many different settings in which be made. The point is that when researchers con-
researchers conduct null hypothesis testing, and duct multiple tests of significance, they must make
the following are just a few of the more common important decisions about how these tests are
settings in which multiplicity issues arise: (a) con- related, and these decisions will directly affect the
ducting pairwise and/or complex contrasts in a power and Type I error rates for both the individ-
linear model with categorical variables; (b) con- ual tests and for the group of tests conducted in
ducting multiple main effect and interaction tests the study.
in a factorial analysis of variance (ANOVA) or
multiple regression setting; (c) analyzing multiple
Type I Error Control
simple effect, interaction contrast, or simple slope
tests when analyzing interactions in linear models; Researchers testing multiple hypotheses, each with
(d) analyzing multiple univariate ANOVAs after a specified Type I error probability (α), risk an
a significant multivariate ANOVA (MANOVA); increase in the overall probability of committing
(e) analyzing multiple correlation coefficients; (f) a Type I error as the number of tests increases. In
assessing the significance of multiple factor load- some cases, it is very important to control for Type
ings or factor correlations in factor analysis; (g) I errors. For example, if the goal of the researcher
analyzing multiple dependent variables separately comparing the three classroom instruction formats
842 Multiple Comparison Tests
described earlier was to identify the most effective matter, how many tests the researcher might con-
instruction method that would then be adopted in duct over his or her career. Second, real differences
schools, it would be important to ensure that between treatment groups are more likely with
a method was not selected as superior by chance a greater number of treatment groups. Therefore,
(i.e., a Type I error). On the other hand, if the goal emphasis in experiments should be not on control-
of the research was simply to identify the best ling for unlikely Type I errors but on obtaining the
classroom formats for future research, the risks most power for detecting even small differences
associated with Type I errors would be reduced, between treatments. The third argument is the
whereas the risk of not identifying a possibly supe- issue of (in)consistency. With more conservative
rior method (i.e., a Type II error) would be error rates, different conclusions can be found
increased. When many hypotheses are being tested, regarding the same hypothesis, even if the test sta-
researchers must specify not only the level of sig- tistics are identical, because the per-test α level
nificance, but also the unit of analysis over which depends on the number of comparisons being
Type I error control will be applied. For example, made. Last, one of the primary advantages of αPT
the researcher comparing the lecture, seminar, and control is convenience. Each of the tests is evalu-
computer-mediated instruction formats must deter- ated with any appropriate test statistic and com-
mine how Type I error control will be imposed pared to an α-level critical value.
over the three pairwise tests. If the probability of The primary disadvantage of αPT control is that
committing a Type I error is set at α for each com- the probability of making at least one Type I error
parison, then the probability that at least one Type increases as the number of tests increases. The
I error is committed over all three pairwise com- actual increase in the probability depends, among
parisons can be much higher than α. On the other other factors, on the degree of correlation among
hand, if the probability of committing a Type I the tests. For independent tests, the probability of
error is set at α for all tests conducted, then the a Type I error with T tests is 1 (1 αÞT ,
probability of committing a Type I error for each whereas for nonindependent tests (e.g., all pairwise
of the comparisons can be much lower than α. comparisons, multiple path coefficients in struc-
The conclusions of an experiment can be greatly tural equation modeling), the probability of a Type
affected by the unit of analysis over which Type I I error is less than 1 (1 αÞT . In general, the
error control is imposed. more tests that a researcher conducts in his or her
experiment, the more likely it is that one (or more)
will be significant simply by chance.
Units of Analysis
Several different units of analysis (i.e., error Familywise Error Rate
rates) have been proposed in the multiple compari- The familywise error rate (αFW) is defined as
son literature. The majority of the discussion has the probability of falsely rejecting one or more
focused on the per-test and familywise error rates, hypotheses in a family of hypotheses. Control-
although other error rates, such as the false discov- ling αFW is recommended when some effects are
ery rate, have recently been proposed. likely to be nonsignificant; when the researcher
is prepared to perform many tests of significance
Per-Test Error Rate
in order to find a significant result; when the
Controlling the per-test error rate (αPT) involves researcher’s analysis is exploratory, yet he or she
simply setting the α level for each test (αT ) equal still wants to be confident that a significant result
to the global α level. Recommendations for con- is real; and when replication of the experiment is
trolling αPT center on a few simple but convincing unlikely.
arguments. First, it can be argued that the natural Although many multiple comparison proce-
unit of analysis is the test. In other words, each test dures purport to control αFW, procedures are said
should be considered independent of how many to provide strong αFW control if αFW is maintained
other tests are being conducted as part of that spe- at approximately α when all population means are
cific analysis or that particular study or, for that equal (complete null) and when the complete null
Multiple Comparison Tests 843
is not true but multiple subsets of the population αPT as the α level for each test is equal to the
means are equal (partial nulls). Procedures that global α level and can therefore be used seamlessly
control αFW for the complete null, but not for par- with any test statistic. It is important to note that
tial nulls, provide weak αFW control. the procedures introduced here are only a small
The main advantage of αFW control is that the subset of the procedures that are available, and for
probability of making a Type I error does not the procedures that are presented, specific details
increase with the number of comparisons con- are not provided. Please see specific sections of the
ducted in the experiment. One of the main disad- encyclopedia for details on these procedures.
vantages of procedures that control αFW is that αT
decreases, often substantially, as the number of
tests increases. Therefore, procedures that control Familywise Error Controlling Procedures
αFW have reduced power for detecting treatment for Any Multiple Testing Environment
effects when there are many comparisons, increas-
ing the potential for inconsistent results between Bonferroni
experiments.
This simple-to-use procedure sets αT ¼ α/T,
where T represents the number of tests being per-
False Discovery Rate formed. The important assumption of the Bonfer-
The false discovery rate represents a compromise roni procedure is that the tests being conducted
between strict αFW control and liberal αPT control. are independent. When this assumption is violated
Specifically, the false discovery rate (aFDR ) is (and it commonly is), the procedure will be too
defined as the expected ratio (Q) of the number of conservative.
erroneous rejections (V) to the total number of
rejections (R ¼V þ S), where S represents the num- Dunn-Sidák
ber of true rejections. Therefore, EðQÞ ¼ The Dunn-Sidák procedure is a more powerful
EðV=½V þ SÞ ¼ EðV=RÞ. version of the original Bonferroni procedure. With
If all null hypotheses are true, αFDR ¼ αFW . On the Dunn-Sidák procedure, αT ¼ 1 (1 α)1/T.
the other hand, if some null hypotheses are false,
αFDR ≤ αFW , resulting in weak control of αFM . As Holm
a result, any procedure that controls αFW also con-
Sture Holm proposed a sequential modification
trols αFDR, but procedures that control αFDR can be
of the original Bonferroni procedure that can be
much more powerful than those that control αFW,
substantially more powerful than the Bonferroni
especially when a large number of tests are per-
or Dunn-Sidák procedures.
formed, and do not entirely dismiss the multiplicity
issue (as with αPT control). Although some research-
Hochberg
ers recommend exclusive use of αFDR control, it is
often recommended that αFDR control be reserved Yosef Hochberg proposed a modified segue
for exploratory research, nonsimultaneous inference to Bonferroni procedure that combined Simes’s
(e.g., if one had multiple dependent variables and inequality with Holm’s testing procedure to create
separate inferences would be made for each), and a multiple comparison procedure that can be more
very large family sizes (e.g., as in an investigation of powerful and simpler than the Holm procedure.
potential activation of thousands of brain voxels in
functional magnetic resonance imaging).
Familywise Error Controlling Procedures
for Pairwise Multiple Comparison Tests
Multiple Comparison Procedures
Tukey
This section introduces some of the multiple
comparison procedures that are available for con- John Tukey proposed the honestly significant
trolling αFW and αFDR. Recall that no multiple difference procedure, which accounts for depen-
comparison procedure is necessary for controlling dencies among the pairwise comparisons and is
844 Multiple Regression
variable based on knowledge of its association its age is. A 5-year-old tree will be 10 feet tall, an
with certain independent variables. In this context, 8-year-old tree will be 16 feet tall, and so on.
the independent variables are commonly referred At this point two important issues must be con-
to as predictor variables and the dependent vari- sidered. First, virtually any time one is working with
able is characterized as the criterion variable. In variables from people, animals, plants, and so forth,
applied settings, it is often desirable for one to be there are no perfect linear associations. Sometimes
able to predict a score on a criterion variable by students with high ACT scores do poorly in college
using information that is available in certain pre- whereas some students with low ACT scores do well
dictor variables. For example, in the life insurance in college. This shows how there can always be
industry, actuarial scientists use complex regres- some error when one uses regression to predict
sion models to predict, on the basis of certain pre- values on a criterion variable. The stronger the asso-
dictor variables, how long a person will live. In ciation between the predictor and criterion variable,
scholastic settings, college and university admis- the less error there will be in that prediction.
sions offices will use predictors such as high school Accordingly, regression is based on the line of best
grade point average (GPA) and ACT scores to pre- fit, which is simply the line that will best describe or
dict an applicant’s college GPA, even before he or capture the relationship between X and Y by mini-
she has entered the university. mizing the extent to which any data points fall off
Multiple regression is most commonly used to that line.
predict values of a criterion variable based on lin- A college admissions committee wants to be
ear associations with predictor variables. A brief able to predict the graduating GPA of the students
example using simple regression easily illustrates whom they admit. The ACT score is useful for this,
how this works. Assume that a horticulturist devel- but as noted above, it does not have a perfect asso-
oped a new hybrid maple tree that grows exactly ciation with college GPA, so there is some error in
2 feet for every year that it is alive. If the height of that prediction. This is where multiple regression
the tree was the criterion variable and the age of becomes very useful. By taking into account the
the tree was the predictor variable, one could accu- association of additional predictor variables with
rately describe the relationship between the age college GPA, one can further minimize the error in
and height of the tree with the formula for predicting college GPA. For example, the admis-
a straight line, which is also the formula for a sim- sions committee might also collect information on
ple regression equation: high school GPA and use that in conjunction with
the ACT score to predict college GPA. In this case,
Y ¼ bX þ a, the regression equation would be
a criterion variable. In social scientific research, where rYX1 is the Pearson correlation between Y
values of the independent and dependent variables and X1, rX1 X2 is the Pearson correlation between
are almost always known. In such cases multiple X1 and X2, and so on, and sY is the standard devi-
regression is used to test whether and to what ation of variable Y, sX1 is the standard deviation of
extent the independent variables explain the depen- X1, and so on. The partial regression coefficients
dent variable. Most often the researcher has theo- are also referred to as unstandardized regression
ries and hypotheses that specify causal relations coefficients because they represent the value by
among the independent variables and the depen- which one would multiply the raw X1 or X2 score
dent variable. Multiple regression is a useful tool in order to arrive at Y. In the salary example, these
for testing such hypotheses. For example, an econ- coefficients could look something like this:
omist is interested in testing a hypothesis about the
determinants of workers’ salaries. The model being Y ¼ 745:67X1 þ 104:36X2 þ 11,325:
tested could be depicted as follows:
This means that subjects’ annual salaries are
family of origin SES → education → salary, best described by an equation whereby their family
of origin SES is multiplied by 745.67, their years
where SES stands for socioeconomic status. of formal education are multiplied by 104.36, and
In this simple model, the economist hypo- these products are added to 11,325. Notice how
thesizes that the SES of one’s family of origin the regression coefficient for SES is much larger
will influence how much formal education one than that for years of formal education. Although
acquires, which in turn will predict one’s salary. If it might be tempting to assume that family of
the economist collected data on these three vari- origin SES is weighted more heavily than years of
ables from a sample of workers, the hypotheses formal education, this would not necessarily be
could be tested with a multiple regression model correct. The magnitude of an unstandardized
that is comparable to the one presented previously regression coefficient is strongly influenced by the
in the college GPA example: units of measurement used to assess the indepen-
dent variable with which it is associated. In this
Y ¼ b1 X1 þ b2 X2 þ a example, assume that SES is measured on a 5-
point scale (Levels 1–5) and years of formal educa-
In this case, what was Y0 is now Y because the tion, at least in the sample, runs from 7 to 20.
value of Y is known. It is useful to deconstruct the These differing scale ranges have a profound effect
components of this equation to show how they on the magnitude of each regression coefficient,
can be used to test various aspects of the econo- rendering them incomparable.
mist’s model. However, it is often the case that researchers
In the equation above, b1 and b2 are the partial want to understand the relative importance of each
regression coefficients. They are the weights by independent variable for explaining variation in
which one multiplies the value of X1 and X2 when the dependent variable. In other words, which is
all variables are in the equation. In other words, the more powerful determinant of people’s sal-
they represent the expected change in Y, per unit aries, their education or their family of origin’s
of X when all other variables are accounted for, or socioeconomic status? This question can be evalu-
held constant. Computationally, the values of b1 ated by examining the standardized regression
and b2 can be determined easily by simply know- coefficient, or β. Computationally, β can be deter-
ing the zero-order correlations among all possible mined by the following formulas:
pairwise combinations of Y, X1, and X2, as well as rYX1 rYX2 rX1 X2 rYX2 rYX1 rX1 X2
the standard deviations of the three variables: β1 ¼ 2
; β2 ¼ :
1r X1 X2 1r2 X1 X2
rYX1 rYX2 rX1 X2 sY
b1 ¼ · ; The components of these formulas are identical to
1r2 X1 X2 sX1
those for the unstandardized regression coefficients,
rYX2 rYX1 rX1 X2 sY
b2 ¼ · , but they lack multiplication by the ratio of standard
1r2 X1 X2 sX2 deviations of Y and X1 and X2. Incidentally, one can
Multiple Regression 847
easily convert β to b with the following formulas, important role in testing hypotheses about the role
which illustrate their relationship: of each independent variable in explaining the
dependent variable.
sY sY In addition to concerns about the statistical sig-
b1 ¼ β1 b2 ¼ β2 ,
s1 s2 nificance and relative importance of each inde-
where sY is the standard deviation of variable Y, pendent variable for explaining the dependent
and so on. variable, it is important to understand the collec-
Standardized regression coefficients can be tive function of the independent variables for
thought of as the weight by which one would explaining the dependent variable. In this case, the
multiply a standardized score (or z score) for question is whether the independent variables col-
each independent variable in order to arrive at lectively explain a significant portion of the vari-
the z score for the dependent variable. Because z ance in scores on the dependent variable. This
scores essentially equate all variables on the question is evaluated with the multiple correlation
same scale, researchers are inclined to make coefficient. Just as a simple bivariate correlation is
comparisons about the relative impact of each represented by r, the multiple correlation coeffi-
independent variable by comparing their associ- cient is represented by R. In most contexts, data
ated standardized regression coefficients, some- analysts prefer to use R2 to understand the associa-
times called beta weights. tion between the independent variables and the
In the economist’s hypothesized model of work- dependent variable. This is because the squared
ers’ salaries, there are several subhypotheses or multiple correlation coefficient can be thought of
research questions that can be evaluated. For as the percentage of variance in the dependent var-
example, the model presumes that both family of iable that is collectively explained by the indepen-
origin SES and education will exert a causal influ- dent variables. So, an R2 value of .65 implies that
ence on annual salary. One can get a sense of 65% of the variance in the dependent variable is
which variable has a greater impact on salary by explained by the combination of independent vari-
comparing their beta weights. However, it is also ables. In the case of two independent variables, the
important to ask whether either of the independent formula for the squared multiple correlation coeffi-
variables is a significant predictor of salary. In cient can be explained as a function of the various
effect, these tests ask whether each independent pairwise correlations among the independent and
variable explains a statistically significant portion dependent variables:
of the variance in the dependent variable, indepen-
dent of that explained by the other independent r2 YX1 þ r2 YX2 2rYXI rYX2 rX1 X2
R2 ¼ :
variable(s) also in the regression equation. This 1r2 X1 X2
can be accomplished by dividing the β by its stan-
dard error (SEβ). This ratio is distributed as t with In cases with more than two independent vari-
degrees of freedom ¼ n k 1, where n is the ables, this formula becomes much more complex,
sample size and k is the number of independent requiring the use of matrix algebra. In such cases,
variables in the regression analysis. Stated more calculation of R2 is ordinarily left to a computer.
formally, The question of whether the collection of inde-
pendent variables explains a statistically significant
β amount of variance in the dependent variable can
t¼ :
SEβ be approached by testing the multiple correlation
coefficient for statistical significance. The test can
If this ratio is significant, that implies that the be carried out by the following formula:
particular independent variable uniquely explains
a statistically significant portion of the variance in R2 ðn k 1Þ
F¼ :
the dependent variable. These t tests of the statisti- ð1 R2 Þk
cal significance of each independent variable are
routinely provided by computer programs that This test is distributed as F with df ¼ k in the
conduct multiple regression analyses. They play an numerator and n k 1 in the denominator,
848 Multiple Regression
where n is the sample size and k is the number of equation later should never be the cause of an inde-
independent variables. pendent variable entered into the equation earlier.
Two important features of the test for signifi- Naturally, hierarchical regression analysis is facili-
cance of the multiple correlation coefficient require tated by having a priori theories and hypotheses
discussion. First, notice how the sample size, n, that specify a particular order of causal priority.
appears in the numerator. This implies that all Another method of entry that is based purely on
other things held constant, the larger the sample empirical rather than theoretical considerations is
size, the larger the F ratio will be. That means that stepwise entry. In this case, the data analyst speci-
the statistical significance of the multiple corre- fies the full compliment of potential independent
lation coefficient is more probable as the sample variables to the computer program and allows it to
size increases. Second, the amount of variation in enter or not enter these variables into the regression
the dependent variable that is not explained by the equation, based on the strength of their unique
independent variables, indexed by 1 R2 (this is association with the dependent variable. The pro-
called error or residual variance), is multiplied by gram keeps entering independent variables up to
the number of independent variables, k. This the point at which addition of any further variables
implies that all other things held equal, the larger would no longer explain any statistically significant
the number of independent variables, the larger increment of variance in the dependent variable.
the denominator, and hence the smaller the F ratio. Stepwise analysis is often used when the researcher
This illustrates how there is something of a pen- has a large collection of independent variables and
alty for using a lot of independent variables in little theory to explain or guide their ordering or
a regression analysis. When trying to explain even their role in explaining the dependent variable.
scores on a dependent variable, such as salary, it Because stepwise regression analysis capitalizes on
might be tempting to use a large number of pre- chance and relies on a post hoc rationale, its use is
dictors so as to take into account as many possi- often discouraged in social scientific contexts.
ble causal factors as possible. However, as this
formula shows, this significance test favors parsi-
Assumptions of Multiple Regression
monious models that use only a few key predic-
tor variables. Multiple regression is most appropriately used as
a data analytic tool when certain assumptions
about the data are met. First, the data should be
Methods of Variable Entry
collected through independent random sampling.
Computer programs used for multiple regression Independent means that the data provided by one
provide several options for the order of entry of participant must be entirely unrelated to the data
each independent variable into the regression equa- provided by another participant. Cases in which
tion. The order of entry can make a difference in husbands and wives, college roommates, or doc-
the results obtained and therefore becomes an tors and their patients both provide data would
important analytic consideration. In hierarchical violate this assumption. Second, multiple regres-
regression, the data analyst specifies a particular sion analysis assumes that there are linear relation-
order of entry of the independent variables, usually ships between the independent variables and the
in separate steps for each. Although there are mul- dependent variable. When this is not the case,
tiple possible logics by which one would specify a more complex version of multiple regression
a particular order of entry, perhaps the most com- known as nonlinear regression must be employed.
mon is that of causal priority. Ordinarily, one A third assumption of multiple regression is that at
would enter independent variables in order from each possible value of each independent variable,
the most distal to the most proximal causes. In the the dependent variable must be normally distrib-
previous example of the workers’ salaries, a hierar- uted. However, multiple regression is reasonably
chical regression analysis would enter family of ori- robust in the case of modest violations of this
gin SES into the equation first, followed by years assumption. Finally, for each possible value of each
of formal education. As a general rule, in hierarchi- independent variable, the variance of the residuals
cal entry, an independent variable entered into the or errors in predicting Y (i.e., Y0 Y) must be
Multiple Treatment Interference 849
consistent. This is known as the homoscedasticity and what variance is associated with some other
assumption. Returning to the workers’ salaries treatment or condition. In terms of independent
example, it would be important that at each level and dependent variable designations, multiple
of family-of-origin SES (Levels 1–5), the degree of treatment interference occurs when participants
error in predicting workers’ salaries was compara- were meant to be assigned to one level of the inde-
ble. If the salary predicted by the regression equa- pendent variable (e.g., a certain group with
tion was within ± $5,000 for everyone at Level 1 a researcher assigned condition) but were function-
SES, but it was within ± $36,000 for everyone at ally at a different level of the variable (e.g., they
Level 3 SES, the homoscedasticity assumption received some of the treatment meant for a com-
would be violated because there is far greater vari- parison group). Consequently, valid conclusions
ability in residuals at the higher compared with about cause and effect are difficult to make.
lower SES levels. When this happens, the validity There are several situations that can result in
of significance tests in multiple regression becomes multiple treatment interference, and they can occur
compromised. in either experimental designs (which have random
assignment of participants to groups or levels of
Chris Segrin the independent variable) or quasi-experimental
designs (which do not have random assignment to
See also Bivariate Regression; Coefficients of Correlation,
groups). One situation might find one or more par-
Alienation, and Determination; Correlation; Logistic
ticipants in one group receiving accidentally, in
Regression; Pearson Product-Moment Correlation
addition to their designated treatment, the treat-
Coefficient; Regression Coefficient
ment meant for a second group. This can happen
administratively in medicine studies, for example,
Further Readings if subjects receive both the drug they are meant to
receive and, accidentally, are also given the drug
Aiken, L. S., & West, S. G. (1991). Multiple regression:
Testing and interpreting interactions. Newbury Park,
meant for a comparison group. If benefits are
CA: Sage. found in both groups or in the group meant to
Alison, P. D. (1999). Multiple regression: A primer. receive a placebo (for example), it is unclear
Thousand Oaks, CA: Sage. whether effects are due to the experimental drug,
Berry, W. D. (1993). Understanding regression the placebo, or a combination of the two. The abil-
assumptions. Thousand Oaks, CA: Sage. ity to isolate the effects of the experimental drug or
Cohen, J., Cohen, P., West, S., & Aiken, L. (2003). (more generally in research design) the independent
Applied multiple regression/correlation analysis for the variable on the outcome variable is the strength of
behavioral sciences (3rd ed.). Hillsdale, NJ: Lawrence a good research design, and consequently, strong
Erlbaum.
research designs attempt to avoid the threat of
Pedhazur, E. J. (1997). Multiple regression in behavioral
research (3rd ed.). New York: Wadsworth.
multiple treatment interference. A second situation
involving multiple treatment interference is more
common, especially in the social sciences. Imagine
an educational researcher interested in the effects
MULTIPLE TREATMENT of a new method of reading instruction. The
researcher has arranged for one elementary teacher
INTERFERENCE in a school building to use the experimental
approach and another elementary teacher to use
Multiple treatment interference is a threat to the in- the traditional method. As is typically the case in
ternal validity of a group design. A problem occurs educational research, random assignment to the
when participants in one group have received all two classrooms is not possible. Scores on a reading
or some of a treatment in addition to the one test are collected from both classrooms as part of
assigned as part of an experimental or quasi-exper- a pre-post test design. The design looks likes this:
imental design. In these situations, the researcher
cannot determine what, if any, influence on the Experimental Group:
outcome is associated with the nominal treatment Pretest ! 12 weeks of instruction ! Posttest
850 Multitrait–Multimethod Matrix
Group 1:
Treatment 1 ! Measure outcome Structure of the MTMM Matrix
Group 2: Table 1 shows a prototypical MTMM matrix for
Treatment 2 ! Measure outcome three traits measured by three methods. An
Multitrait–Multimethod Matrix 851
Method 3 Trait 2 .04 .50 .05 .02 .40 .03 .30 (.85)
Trait 3 .04 .04 .50 .02 .02 .40 .30 .30 (.85)
Notes: The correlations are artificial. Reliabilities are in parentheses. Heterotrait–monomethod correlations are in the gray
subdiagonals. Heterotrait–heteromethod correlations are enclosed by a broken line. Monotrait–heteromethod correlations in
the convergent validity diagonals are in bold type.
MTMM matrix consists of two major parts: traits measured by different methods. The hetero-
monomethod blocks and heteromethod blocks. trait–heteromethod triangles cover the correlations
of different traits measured by different methods.
Monomethod Blocks
The monomethod blocks contain the correla- Criteria for Evaluating the MTMM Matrix
tions between variables that belong to the same
method. In Table 1 there are three monomethod Campbell and Fiske described four properties an
blocks, one for each method. Each monomethod MTMM matrix should show when convergent
block consists of two parts. The first part (reli- and discriminant validity is present:
ability diagonals) contains the reliabilities of
the measures. The second part (the heterotrait– 1. The correlations in the validity diagonals
monomethod triangles) include the correlations (monotrait–heteromethod correlations) should
be significantly different from 0 and they should
between different traits that are measured by the
be large. These correlations indicate convergent
same methods. The reliabilities can be consid- validity.
ered as monotrait–monomethod correlations.
2. The heterotrait–heteromethod correlations
should be smaller than the monotrait–
Heteromethod Blocks heteromethod correlations (discriminant
validity).
The heteromethod blocks comprise the correla-
tions between traits that were measured by different 3. The heterotrait–monomethod correlations
methods. Table 1 contains three heteromethod should be smaller than the montrait–
blocks, one for each combination of the three meth- heteromethod correlations (discriminant
ods. A heteromethod block consists of two parts. validity).
The validity diagonal (monotrait–heteromethod 4. The same pattern of trait intercorrelations
correlations) contains the correlations of the same should be shown in all heterotrait triangles in
852 Multitrait–Multimethod Matrix
the monotrait as well as in the heteromethod a method M2) is the product of the correlations
blocks (discriminant validity). Cor(T1, T2) and Cor(M1, M2):
example, all trait factors are uncorrelated, this problem of the CTCM model in many cases.
would indicate perfect discriminant validity. How- This model with one method factor less than
ever, if all method factors are correlated, it is diffi- the number of methods considered is called
cult to interpret the uncorrelatedness of the trait the correlated-trait-correlated-(method – 1) model
factors as perfect discriminant validity because the ðCTC½M 1Þ. This model is a special case of the
method factor correlations represent a portion of model depicted in Figure 1 but with one method
variance shared by all variables that might be due factor less. If the first method in Figure 1 is the
to a general trait effect. self-report, the second method is the teacher
The major problems of the CTCM model are report, and the third method is the parent report,
caused by the correlated method factors. Accord- dropping the first method factor would imply
ing to Michael Eid, Tanja Lischetzke, and Fridtjof that the three trait factors equal the true-score
Nussbeck, the problems of the CTCM model can variables of the self-reports. The self-report
be circumvented by dropping the correlations method would play the role of the reference
between the method factors or by dropping one method that has to be chosen in this model.
method factor. A CTCM model without correla- Hence, in the CTC(M 1) model, the trait factor
tions between method factors is called a correlated- is completely confounded with the reference
trait–uncorrelated-method (CTUM) model. This method. The method factors have a clear meaning.
model is a special case of the model depicted in They indicate the deviation of the true (error-free)
Figure 1 but with uncorrelated method factors. other reports from the value predicted by the self-
This model is reasonable if correlations between report. A method effect is that (error-free) part of
method factors are not expected. According to a nonreference method that cannot be predicted
Eid and colleagues, this is the case when inter- by the reference method. The correlations between
changeable methods are considered. Interchange- the two method factors would then indicate that
able methods are methods that are randomly the two other raters (teachers and parents) share
chosen from a set of methods. If one considers a common view of the child that is not shared by
different raters as different methods, an example the child herself or himself. This model allows con-
of interchangeable raters (methods) is randomly trasting methods, but it does not contain common
selected students rating their teacher. If one ran- ‘‘method-free’’ trait factors. It is doubtful that such
domly selects three students for each teacher and a common trait factor has a reasonable meaning in
if one assigns these three students randomly to the case of structurally different methods.
three rater groups, the three method factors All single indicator models presented so far
would represent the deviation of individual assume that the method effects belonging to one
raters from the expected (mean) rating of the method are unidimensional as there is one com-
teacher (the trait scores). Because the three raters mon method factor for each method. This assump-
are interchangeable, correlations between the tion could be too strong as method effects could
method factors would not be expected. Hence, be trait specific. Trait-specific method effects are
applying the CTUM model in the case of inter- part of the residual in the models presented so far.
changeable raters would circumvent the pro- That means that reliability will be underestimated
blems of the CTCM model. because a part of the residual is due to method
The situation, however, is quite different in the effects and not due to measurement error. More-
case of structurally different methods. An example over, the models may not fit the data in the case of
of structurally different methods is a self-rating, trait-specific method effects. If the assumption of
a rating by the parents, and a rating by the teacher. unidimensional method factors for a method is too
In this case, the three raters are not interchange- strong, the method factors can be dropped and
able but are structurally different. In this case, the replaced by correlated residuals. For example, if
CTUM model is not reasonable as it may not ade- one replaces the method factors in the CTUM
quately represent the fact that teachers and parents model by correlations of residuals belonging to the
can share a common view that is not shared same methods, one obtains the correlated-trait–
with the student (correlations of method effects). correlated-uniqueness (CTCU) model. However, in
In this case, dropping one method factor solves the this model the reliabilities are underestimated
Multivalued Treatment Effects 855
because method effects are now part of the resi- Further Readings
duals. Moreover, the CTUM model does not
Browne, M. W. (1984). The decomposition of multitrait-
allow correlations between residuals of different multimethod matrices. British Journal of
methods. This might be necessary in the case of Mathematical & Statistical Psychology, 37, 1–21.
structurally different methods. Problems that Campbell, D. T., & Fiske, D. W. (1959). Convergent and
are caused by trait-specific method effects can discriminant validation by the multitrait-multimethod
be appropriately handled in multiple indicator matrix. Psychological Bulletin, 56, 81–105.
models. Dumenci, L. (2000). Multitrait-multimethod analysis. In S.
D. Brown & H. E. A. Tinsley (Eds.), Handbook of
applied multivariate statistics and mathematical modeling
(pp. 583–611). San Diego, CA: Academic Press.
Eid, M. (2000). A multitrait-multimethod model with
Multiple Indicator Models minimal assumptions. Psychometrika, 65, 241–261.
In multiple indicator models, there are several Eid, M. (2006). Methodological approaches for analyzing
indicators for one trait–method unit. In the less multimethod data. In M. Eid & E. Diener (Eds.),
Handbook of multimethod measurement in
restrictive model, there is one factor for all indi-
psychology (pp. 223–230). Washington, DC:
cators belonging to the same trait–method unit. American Psychological Association.
The correlations between these factors constitute Eid, M., & Diener, E. (2006). Handbook of multimethod
a latent MTMM matrix. The correlation coeffi- measurement in psychology. Washington, DC:
cients of this latent MTMM matrix are not dis- American Psychological Association.
torted by measurement error and allow a more Eid, M., Nussbeck, F. W., Geiser, C., Cole, D. A.,
appropriate application of the Campbell and Fiske Gollwitzer, M., & Lischetzke, T. (2008). Structural
criteria for evaluating the MTMM matrix. Multi- equation modeling of multitrait-multimethod data:
ple indicator models allow the definition of trait- Different models for different types of methods.
specific method factors and, therefore, the separa- Psychological Methods, 13, 230–253.
Kenny, D. A. (1976). An empirical application of
tion of measurement error and method-specific
confirmatory factor analysis to the multitrait-
influences in a more appropriate way. Eid and col- multimethod matrix. Journal of Experimental Social
leagues have shown how different models of CFA Psychology, 12, 247–252.
can be defined for different types of methods. In Marsh, H. W., & Grayson, D. (1995). Latent variable
the case of interchangeable methods, a multilevel models of multitrait-multimethod data. In R. H. Hoyle
CFA model can be applied that allows the specifi- (Ed.), Structural equation modeling: Concepts, issues, and
cation of trait-specific method effects. In contrast applications (pp. 177–198). Thousands Oaks, CA: Sage.
to the extension of the CTCU model to multiple Shrout, P. E., & Fiske, S. T. (Eds.). (1995). Personality
indicators, the multilevel approach has the advan- research, methods, and theory: A festschrift honoring
tage that the number of methods (e.g., raters) can Donald W. Fiske. Hillsdale, NJ: Lawrence Erlbaum
differ between targets. In the case of structurally
different raters, an extension of the CTC(M 1)
model to multiple indicators can be applied. This
model allows a researcher to test specific hypothe- MULTIVALUED TREATMENT
ses about the generalizability of method effects EFFECTS
across traits and methods. In the case of a combi-
nation of structurally different and interchangeable
The term multivalued treatment effects broadly
methods, a multilevel CTC(M 1) model would
refers to a collection of population parameters that
be appropriate.
capture the impact of a given treatment assigned
Michael Eid to each observational unit, when this treatment
status takes multiple values. In general, treatment
See also Construct Validity; ‘‘Convergent and levels may be finite or infinite as well as ordinal or
Discriminant Validation by the Multitrait– cardinal, leading to a large collection of possible
Multimethod Matrix’’; MBESS; Structural Equation treatment effects to be studied in applications.
Modeling; Triangulation; Validity of Measurement When the treatment effect of interest is the mean
856 Multivalued Treatment Effects
outcome for each treatment level, the resulting in most applications, which treatment each unit
population parameter is typically called the dose– has taken up is not random and hence further
response function in the statistical literature, reg- assumptions would be needed to identify the treat-
ardless of whether the treatment levels are finite or ment effect of interest.
infinite. The analysis of multivalued treatment A binary treatment effect model has
effects has several distinct features when compared T ¼ f0; 1g, a finite multivalued treatment effect
with the analysis of binary treatment effects, model has T ¼ f0; 1; . . . ; Jg for some positive inte-
including the following: (a) A comparison or con- ger J, and a continuous treatment effect model has
trol group is not always clearly defined, (b) new T ¼ ½0; 1. (Note that the values in T are ordinal,
parameters of interest arise capturing distinct phe- that is, they may be seen just as normalizations of
nomena such as nonlinearities or tipping points, the underlying real treatment levels in a given
(c) in most cases correct statistical inferences application.) Many applications focus on a binary
require the joint estimation of all treatment effects treatment effects model and base the analysis on
(as opposed to the estimation of each treatment the comparison of two groups, usually called treat-
effect at a time), and (d) efficiency gains in statisti- ment group ðTi ¼ 1Þ and control group ðTi ¼ 0Þ.
cal inferences may be obtained by exploiting A multivalued treatment may be collapsed into
known restrictions among the multivalued treat- a binary treatment, but this procedure usually
ment effects. This entry discusses the treatment would imply some important loss of information
effect model and statistical inference procedures in the analysis. Important phenomena such as non-
for multivalued treatment effects. linearities, differential effects across treatment
levels or tipping points, cannot be captured by
a binary treatment effect model.
Treatment Effect Model Typical examples of multivalued treatment eff-
ects are comparisons between some characteristic
and Population Parameters
of the distributions of the potential outcomes.
A general statistical treatment effect model with Well-known examples are mean and quantile com-
multivalued treatment assignments is easily des- parisons, although in many applications other fea-
cribed in the context of the classical potential out- tures of these distributions may be of interest. For
comes model. This model assumes that each example, assuming, to simplify the discussion, that
unit i in a population has an underlying collec- the random potential outcomes are equal for all
tion of potential outcome random variables units (this holds, for instance, in the context of
fYi ¼ Yi ðtÞ : t ∈ T g, where T denotes the collec- random sampling), the mean of the potential
tion of possible treatment assignments. The ran- outcome under treatment regime t ∈ T is given by
dom variables Yi ðtÞ are usually called potential μðtÞ ¼ E½Yi ðtÞ. The collection of these means is
outcomes because they represent the random out- the so-called dose–response function. Using this
come that unit i would have under treatment estimand, it is possible to construct different multi-
regime t ∈ T . For each unit i and for any two treat- valued treatment effects of interest, such as pair-
ment levels, t1 and t2 , it is always possible to wise comparisons (e.g., μðt2 Þ μðt1 ÞÞ or differ-
define the individual treatment effect given by ences in pairwise comparisons, which would
Yi ðt1 Þ Yi ðt2 Þ, which may or may not be a degen- capture the idea of nonlinear treatment effects. (In
erate random variable. However, because units are the particular case of binary treatment effects, the
not observed under different treatment regimes only possible pairwise comparison is μð1Þ μð0Þ,
simultaneously, such comparisons are not feasible. which is called the average treatment effect.) Using
This idea, known as the fundamental problem of the dose–response function, it is also possible to
causal inference, is formalized in the model by consider other treatment effects that arise as
assuming that for each unit i only (Yi ; Ti ) is nonlinear transformations of μðtÞ, such as ratios,
observed, where Yi ¼ Yi ðTi Þ and Ti ∈ T . In words, incremental changes, tipping points, or the maxi-
for each unit i, only the potential outcome for mal treatment effect μ * ¼ maxt ∈ T μðtÞ, among
treatment level Ti ¼ t is observed while all other many other possibilities. All these multivalued
(counterfactual) outcomes are missing. Of course, treatment effects are constructed on the basis of
Multivariate Analysis of Variance (MANOVA) 857
the mean of the potential outcomes, but similar Imai, K., & van Dyk, D. A. (2004). Causal inference with
estimands may be considered that are based on general treatment regimes: Generalizing the propensity
quantiles, dispersion measures, or other character- score. Journal of the American Statistical Association,
istics of the underlying potential outcome distribu- 99, 854–866.
Imbens, G. W., & Wooldridge, J. M. (2009). Recent
tion. Conducting valid hypothesis testing about
developments in the econometrics of program
these treatment effects requires in most cases the evaluation. Journal of Economic Literature, 47, 5–86.
joint estimation of the underlying multivalued Rosembaum, P. (2002). Observational studies. New
treatment effects. York: Springer.
Statistical Inference
There exists a vast theoretical literature proposing
and analyzing different statistical inference proce-
MULTIVARIATE ANALYSIS
dures for multivalued treatment effects. This large OF VARIANCE (MANOVA)
literature may be characterized in terms of the key
identifying assumption underlying the treatment Multivariate analysis of variance (MANOVA)
effect model. This key assumption usually takes the designs are appropriate when multiple depen-
form of a (local) independence or orthogonality con- dent variables are included in the analysis. The
dition, such as (a) a conditional independence dependent variables should represent continuous
assumption, which assumes that conditional on a set measures (i.e., interval or ratio data). Dependent
of observable characteristics, selection into treatment variables should be moderately correlated. If
is random, or (b) an instrumental variables assump- there is no correlation at all, MANOVA offers
tion, which assumes the existence of variables that no improvement over an analysis of variance
induce exogenous changes in the treatment assign- (ANOVA); if the variables are highly correlated,
ment. With the use of an identifying assumption the same variable may be measured more than
(together with other standard model assumptions), it once. In many MANOVA situations, multiple
has been shown in the statistical and econometrics independent variables, called factors, with multi-
literatures that several parametric, semiparametric, ple levels are included. The independent vari-
and nonparametric procedures allow for optimal ables should be categorical (qualitative). Unlike
joint inference in the context of multivalued treat- ANOVA procedures that analyze differences
ments. These results are typically obtained with the across two or more groups on one dependent vari-
use of large sample theory and justify (asymptoti- able, MANOVA procedures analyze differences
cally) the use of classical statistical inference proce- across two or more groups on two or more depen-
dures involving multiple treatment levels. dent variables. Investigating two or more depen-
Matias D. Cattaneo dent variables simultaneously is important in
various disciplines, ranging from the natural and
See also Multiple Treatment Interference; Observational physical sciences to government and business and
Research; Propensity Score Analysis; Selection; to the behavioral and social sciences. Many
Treatment(s) research questions cannot be answered adequately
by an investigation of only one dependent variable
Further Readings because treatments in experimental studies are
likely to affect subjects in more than one way. The
Cattaneo, M. D. (2010). Efficient semiparametric focus of this entry is on the various types of MAN-
estimation of multi-valued treatment effects under OVA procedures and associated assumptions. The
ignorability. Journal of Econometrics, 155, 138–154.
logic of MANOVA and advantages and disadvan-
Heckman, J. J., & Vytlacil, E. J. (2007). Econometric
evaluation of social programs, Part I: Causal models,
tages of MANOVA are included.
structural models and econometric policy evaluation. MANOVA is a special case of the general linear
In J. J. Heckman and E. E. Leamer (Eds.), Handbook models. MANOVA may be represented in a basic
of econometrics (Vol. 6B, pp. 4779–4874). linear equation as Y ¼ Xβ þ ε, where Y represents
Amsterdam: North-Holland. a vector of dependent variables, X represents
858 Multivariate Analysis of Variance (MANOVA)
a matrix of independent variables, β represents to rejection of a true null hypothesis. For example,
a vector of weighted regression coefficients, and ε analysis of group differences on three dependent
represents a vector of error terms. Calculations for variables would require three univariate tests. If
the multivariate procedures are based on matrix the alpha level is set at .05, there is a 95% chance
algebra, making hand calculations virtually impos- of not making a Type I error. The following calcula-
sible. For example, the null hypothesis for MAN- tions show how the 95% error rate is compounded
OVA states no difference among the population with three univariate tests: (.95)(.95)(.95) ¼ .857
mean vectors. The form of the omnibus null and 1 .857 ¼ .143, or 14.3%, which is an unac-
hypothesis is written as H0 ¼ μ1 ¼ ¼ μk . It is ceptable error rate. In addition, univariate tests do
important to remember that the means displayed not account for the intercorrelations among vari-
in the null hypothesis represent mean vectors for ables, thus risking loss of valuable information. Fur-
the population, rather than the population means. thermore, MANOVA decreases the Type II error
The complexity of MANOVA calculations requires (error of not rejecting a false null hypothesis) rate
the use of statistical software for computing. by detecting group differences that appear only
through the combination of two or more dependent
variables.
Logic of MANOVA
MANOVA procedures evaluate differences in pop-
Disadvantages of MANOVA Designs
ulation means on more than one dependent vari-
able across levels of a factor. MANOVA uses MANOVA procedures are more complex than uni-
a linear combination of the dependent variables to variate procedures; thus, outcomes may be ambig-
form a new dependent variable that minimizes uous and difficult to interpret. The power of
within-group variance and maximizes between- MANOVA may actually reveal statistically signifi-
group differences. The new variable is used in an cant differences when multiple univariate tests
ANOVA to compare differences among the groups. may not show differences. Statistical power is the
Use of the newly formed dependent variable in the probability of rejecting the null hypothesis when
analysis decreases the Type I error (error of reject- the null is false. (Power ¼ 1 β.) The difference
ing a true null hypothesis) rate. The linear combi- in outcomes between ANOVA and MANOVA
nation reveals a more complete picture of the results from the overlapping of the distributions
characteristic or attribute under study. For exam- for each of the groups with the dependent vari-
ple, a social scientist may be interested in the kinds ables in separate analyses. In the MANOVA pro-
of attitudes that people have toward the environ- cedure, the linear combination of dependent
ment based on their attitudes about global warm- variables is used for the analysis. Finally, more
ing. In such a case, analysis of only one dependent assumptions are required for MANOVA than for
variable (attitude about global warming) is not ANOVA.
completely representative of the attitudes that
people have toward the environment. Multiple
Assumptions of MANOVA
measures, such as attitude toward recycling, will-
ingness to purchase environmentally friendly pro- The mathematical underpinnings of inferential
ducts, and willingness to conserve water and statistics require that certain statistical assumptions
energy, will give a more holistic view of attitudes be met. Assumptions for MANOVA designs are
toward the environment. In other words, MAN- (a) multivariate normality, (b) homoscedasticity,
OVA analyzes the composite of several variables, (c) linearity, and (d) independence and randomness.
rather than analyzing several variables individually.
Multivariate Normality
Advantages of MANOVA Designs
Observations on all dependent variables are
MANOVA procedures control for experiment- multivariately normally distributed for each level
wide error rate, whereas multiple univariate proce- within each group and for all linear combinations
dures increase the Type I error rate, which can lead of the dependent variables. Joint normality in
Multivariate Analysis of Variance (MANOVA) 859
more than two dimensions is difficult to assess; positive or negative to indicate a high peak or flat-
however, tests for univariate normality on each ness near the mean, respectively. Values within ± 2
of the variables are recommended. Univariate nor- standard deviations from the mean or ± 3 standard
mality, a prerequisite to multivariate normality, deviations from the mean are generally considered
can be assessed graphically and statistically. For within the normal range. A normal distribution has
example, a quantile–quantile plot resembling a zero kurtosis. In addition to graphical techniques,
straight line suggests normality. While normality the Shapiro–Wilk W statistic and the Kolmogorov–
of the univariate tests does not mean that the data Smirnov statistic with Lilliefors significance levels
are multivariately normal, such tests are useful in are used to assess normality. Statistically significant
evaluating the assumption. MANOVA is insensi- W or Kolmogorov–Smirnov test results indicate
tive (robust) to moderate departures from normal- that the distribution is nonnormal.
ity for large data sets and in situations in which
the violations are due to skewed data rather
than outliers. Homoscedasticity
A scatterplot for pairs of variables for each The variance and covariance matrices for all
group can reveal data points located far from the dependent variables across groups are assumed to
pattern produced by the other observations. be equal. George Box’s M statistic tests the null
Mahalanobis distance (distance of each case from hypothesis of equality of the observed covariance
the centroid of all the remaining cases) is used to matrices for the dependent variables for each
detect multivariate outliers. Significance of Maha- group. A nonsignificant F value with the alpha
lanobis distance is evaluated as a chi-square statis- level set at .001 from Box’s M indicates equality of
tic. A case may be considered an outlier if its the covariance matrices. MANOVA procedures
Mahalanobis distance is statistically significant at can tolerate moderate departures from equal
the p < .0001 level. Other graphical techniques, variance–covariance matrices when sample sizes
such as box plots and stem-and-leaf plots, may be are similar.
used to assess univariate normality. Two additional
descriptive statistics related to normality are skew-
ness and kurtosis. Linearity
variable. Randomness means that the sample was his work on the multivariate T2 distribution. Cal-
randomly selected from the population of interest. culation of T2 is based on the combination of two
Student’s t ratios and their pooled estimate of cor-
relation. The resulting T2 is converted into an F
Research Questions for MANOVA Designs statistic and distributed as an F distribution.
Multivariate analyses cover a broad range of sta-
tistical procedures. Common questions for which One-Way MANOVA
MANOVA procedures are appropriate are as fol-
lows: What are the mean differences between two Another variation of the MANOVA procedure is
levels of one independent variable for multiple useful for investigating the effects of one multilevel
dependent variables? What are the mean differ- independent variable (factor) on two or more
ences between or among multiple levels of one dependent variables. An investigation of differences
independent variable on multiple dependent vari- in mathematics achievement and motivation for stu-
ables? What are the effects of multiple independent dents assigned to three different teaching methods
variables on multiple dependent variables? What is such a situation. For this problem, the researcher
are the interactions among independent variables has one multilevel factor (teaching method with
on one dependent variable or on a combination of three levels) and two dependent variables (mathe-
dependent variables? What are the mean differ- matics test scores and scores on a motivation scale).
ences between or among groups when repeated The objective is to determine the differences among
measures are used in a MANOVA design? What the mean vectors for groups on the dependent vari-
are the effects of multiple levels of an independent ables, as well as differences among groups for the
variable on multiple dependent variables when linear combinations of the dependent variables.
effects of concomitant variables are removed from This form of MANOVA extends Hotelling’s T2 to
the analysis? What is the amount of shared vari- more than two groups; it is known as the one-way
ance among a set of variables when variables are MANOVA, and it can be thought of as the MAN-
grouped around a common theme? What are the OVA analog of the one-way F situation. Results of
relationships among variables that may be useful the MANOVA produce four multivariate test statis-
for predicting group membership? These sample tics: Pillai’s trace, Wilks’s lambda (), Hotelling’s
questions provide a general sense of the broad trace, and Roy’s largest root. Usually results will
range of questions that may be answered with not differ for the first three tests when applied to
MANOVA procedures. a two-group study; however, for studies involving
more than two groups, tests may yield different
results. The Wilks’s is the test statistic reported
Types of MANOVA Designs most often in publications. The value of Wilks’s
ranges from 0 to 1. A small value of indicates
Hotelling’s T2
statistically significant differences among the groups
Problems for MANOVA can be structured in or treatment effects. Wilks’s , the associated F
different ways. For example, a researcher may value, hypotheses and error degrees of freedom,
wish to examine the difference between males and and the p value are usually reported. A significant F
females on number of vehicle accidents in the past value is one that is greater than the critical value of
5 years and years of driving experience. In this F at predetermined degrees of freedom for a preset
case, the researcher has one dichotomous indepen- level of significance. As a general rule, tables for
dent variable (gender) and two dependent vari- critical values of F and accompanying degrees of
ables (number of accidents and years of driving freedom are published as appendixes in many
experience). The problem is to determine the dif- research and statistics books.
ference between the weighted sample mean vectors
(centroids) of a multivariate data set. This form of
Factorial MANOVA
MANOVA is known as the multivariate analog to
the Student’s t test, and it is referred to as Hotell- Another common variation of multivariate pro-
ing’s T2 statistic, named after Harold Hotelling for cedures is known as the factorial MANOVA. In this
Multivariate Analysis of Variance (MANOVA) 861
design, the effects of multiple factors on multiple over time across a set of response variables mea-
dependent variables are examined. For example, sured at each time while accounting for the cor-
the effects of geographic location and level of relation among responses. A design would be
education on job satisfaction and attitudes toward considered doubly multivariate when multiple con-
work may be investigated via a factorial MAN- ceptually dissimilar dependent variables are mea-
OVA. Geographic location with four levels and sured across multiple time periods, as in a repeated
level of education with two levels are the factors. measures study. For example, a study to compare
Geographic location could be coded as 1 ¼ south, problem-solving strategies of intrinsically and
2 ¼ west, 3 ¼ north, and 4 ¼ east; level of education extrinsically motivated learners in different test
could be coded as 1 ¼ college graduate and 0 ¼ not situations could involve two dependent measures
college graduate. The MANOVA procedure will (score on a mathematics test and score on a reading
produce the main effects for each of the factors, as test) taken at three different times (before a unit of
well as the interaction between the factors. For this instruction on problem solving, immediately fol-
example, three new dependent variables will be cre- lowing the instruction, and 6 weeks after the
ated to maximize group differences: one dependent instruction) for each participant. Type of learner
variable to maximize the differences in geographic and test situation would be between-subjects fac-
location and the linear combination of job satisfac- tors and time would be a within-subjects factor.
tion and attitudes toward work; one dependent var-
iable to maximize the differences in education and
the linear combination of job satisfaction and atti- Multivariate Analysis of Covariance
tudes toward work; and another dependent variable
A blend of analysis of covariance and MAN-
to maximize separation among the groups for the
OVA, called multivariate analysis of covariance
interaction between geographic location and level
(MANCOVA) allows the researcher to control for
of education. As in the previous designs, the facto-
the effects of one or more covariates. MANCOVA
rial MANOVA produces Pillai’s trace, Wilks’s ,
allows the researcher to control for sources of vari-
Hotelling’s trace, and Roy’s largest root. The multi-
ation within multiple variables. In the earlier
ple levels in factorial designs may produce slightly
example on attitudes toward the environment, the
different values for the test statistics, even though
effects of concomitant variables such as number of
these differences do not usually affect statistical sig-
people living in a household, age of head of house-
nificance. Wilks’s , associated F statistic, degrees
hold, gender, annual income, and education level
of freedom, and the p value are usually reported in
can be statistically removed from the analysis with
publications.
MANCOVA.
K Group MANOVA
MANOVA designs with three or more groups Factor Analysis
are known as K group MANOVAs. Like other mul-
tivariate designs, the null hypothesis tests whether MANOVA is useful as a data reduction proce-
differences between the mean vectors of K groups dure to condense a large number of variables into
on the combination of dependent variables are due a smaller, more definitive set of hypothetical con-
to chance. As with the factorial design, the K group structs. This procedure is known as factor analy-
MANOVA produces the main effects for each fac- sis. Factor analysis is especially useful in survey
tor, as well as the interactions between factors. The research to reduce a large number of variables (sur-
same statistical tests and reporting requirements vey items) to a smaller number of hypothetical vari-
apply to the K group situation as to the factorial ables by identifying variables that group or cluster
MANOVA. together. For example, two or more dependent vari-
ables in a data set may measure the same entity or
construct. If this is the case, the variables may be
Doubly Multivariate Designs
combined to form a new hypothetical variable. For
The purpose of doubly multivariate studies is to example, a survey of students’ attitudes toward
test for statistically significant group differences work may include 40 related items, whereas a factor
862 Multivariate Normal Distribution
Note that the covariance matrix
is symmetric are both multivariate normal with the same first
and positive definite.
The
ði; jÞth element is given two moments, then they are similarly distributed.
by σ ij ¼ E ðXi μi Þ Xj μj and σ 2i ≡
σ ii ¼ VðXi Þ. 2. Let X ¼ðX1 ; . . . ; Xn Þ0 be a multivariate normal
An important special case of the multivariate random vector with mean μ and covariance
normal distribution
is the bivariate normal.
matrix
, and let α0 ¼ ðα1 ; . . . ; αn Þ ∈ Rn =0. The lin-
X1 μ1 ear combination Y ¼ α0 X ¼ α1 X1 þ þ αn Xn is
If ∼ N2 ðμ;
Þ, where μ¼ ,
X2 μ2 normal with mean EðYÞ ¼ α0 μ and variance
Pn PP
σ 21 ρσ 1 σ 2 VðYÞ¼α0
α¼ α2i VðXi Þþ αi αj CovðXi ; Xj Þ.
¼ and ρ ≡ CorrðX1 ; X2 Þ ¼
ρσ 1 σ 2 σ 22 i¼1 i6¼j
σ 12
σ σ , then the bivariate density is given by
Also, if α0 X is normal with mean α0 μ and variance
1 2
α0
α for all possible α, then X must be a multivari-
1 ate normal random vector with mean μ and
fX1 , X2 ðx; yÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2πσ 1 σ 2 ð1 ρ2 Þ covariance matrix
(X ∼ Nn ðμ;
Þ).
8 2 2 9
>
< xμ1
2ρ σ
xμ1 yμ2
þ σ
yμ2 >
=
σ1 1 σ2 2 3. More generally, let X ¼ðX1 ; . . . ; Xn Þ0 be a multi-
exp :
>
: 2ð1 ρ2 Þ >
; variate normal random vector with mean μ and
covariance matrix
, and let A ∈ Rm × n be a full
rank matrix with m ≤ n, the set of linear combina-
Let X ¼ðX1 ; X2 Þ0 ; the joint density can be tions Y ¼ ðY1 ; . . . ; Ym Þ0 ¼ AX is multivariate nor-
rewritten in matrix notation as mally distributed with mean Aμ and covariance
matrix A
A0 . Also, if Y ¼ AX þ b where b is
1 1 1
fX ðxÞ ¼ 1
exp ð x μ Þ0
ðx μ Þ : a m × 1 vector of constants, then Y is multivariate
2πj
j2 2 normally distributed with mean Aμ þ b and co-
variance matrix A
A0 .
Multivariate Normal Density Contours
4. If Xi and Yj are jointly normally distributed,
The contour levels of fX ðxÞ, that is, the set of then they are independent if and only if
points in Rn for which fX ðxÞ is constant, satisfy Cov ðYi ; Yj Þ ¼ 0. Note that it is not necessarily
1
true that uncorrelated univariate normal random
ðx μÞ0
ðx μÞ ¼ c2 : variables are independent. Indeed, two random
variables that are marginally normally distributed
These surfaces are n-dimensional ellipsoids cen- may fail to be jointly normally distributed.
tered at μ, whose axes of symmetry are given by
the principal components (the eigenvectors) of
. 5. Let Z ¼ ðZ1 ; . . . ; Zn Þ0 where Zi ∼ i:i:d: Nð0; 1Þ
Specifically,
pffiffiffiffi the length of the ellipsoid along the ith (where i.i.d. ¼ independent and identically dis-
axis is c λi , where λi is the ith eigenvalue associ- tributed). Z is said to be standard multivariate nor-
ated with the eigenvector ei (recall that eigen- mal, denoted Z ∼ Nð0, In Þ, and it can be shown
vectors ei and eigenvalues λi are solutions to that E½Z ¼ 0 and VðZÞ ¼ In , where In denotes the
ei ¼ λi ei for i ¼ 1; . . . ; n). unit matrix of order n. The joint density of vector
Z is given by
Some Basic Properties Yn
1
The following list presents some important proper- fZ ðzÞ ¼ fZi ðzi Þ ¼ ð2πÞn=2 exp z0 z :
i¼1
2
ties involving the multivariate normal distribution.
The density fZ ðzÞ is symmetric and unimodal
1. The first two moments of a multivariate normal with mode equal to zero. The contour levels of
distribution, namely μ and
, completely charac- fZ ðzÞ, that is, the set of points in Rn for which
terize the distribution. In other words, if X and Y fZ ðzÞ is constant, are defined by
864 Multivariate Normal Distribution
X
n
n=2 1 0
0
zz¼ z2i ¼c ,2 fZ ðzÞ ¼ ð2πÞ exp z z :
i¼1
2
The moment generating function of Z is
where c ≥ 0. The contour levels of fZ ðzÞ are con- obtained as follows:
centric circles in Rn centered at zero.
h i Z
Mz ðtÞ¼E e t’Z ¼ð2πÞ n=2
Rn expft0 zz0 z=2gdz
6. If Y1 ; . . . ; Yn ∼ ind N μi , σ 2i , then σ ij ¼ 0 for n
all i 6¼ j, and it follows that
is a diagonal matrix. Yn Z þ∞
1 1 2
Thus, if ¼ pffiffiffiffiffiffi exp ti zi zi dzi
i¼1 ∞ 2π 2
0 2 1
σ1 0 0 Yn
B 0 σ 22 0 C ¼ MZi ðti Þ
B C
¼B . .. . . .. C, i¼1
@ .. . . . A
¼E et1 Z1 E et2 Z2 ...E etn Zn :
0 0 σ 2n
12 1 2 1
¼exp t1 exp t2 ...exp tn2
then 2 2 2
( )
0 2 1 1X n
1 σ1 0 2 0 ¼exp t2
B 0 C 2 i¼1 i
1
B 0 1 σ2 C
¼B . .. .. .. C , 1
@ .. . . . A ¼exp t0 t
2
0 0 1 σ 2n
so that
To obtain the moment generating function
1 X
n
2 . 2 of the generalized location-scale family, let
ðy μÞ0
ðy μÞ ¼ y j μj σj : X ¼ fμ þ
1=2 Z where
1=2
1=2 ¼
(
1=2 is
j¼1
obtained via the Cholesky decomposition of
) so
Note also that, as
is diagonal, we have that X ∼ Nðμ;
Þ. Hence,
h 0 i
j
j ¼ σ 21 σ 22 . . . σ 2n :
MX ðtÞ ¼ E et X
The joint density becomes 1=2
¼ E exp t0 μþt0
Z
Y
n
1=2
fY ðyÞ ¼ fYi yi ; μi ; σ 2i t0 μ
i¼1 ¼ e E exp t0
Z
( 2 )
Y
n
1 1 y i μi 0 !
¼ pffiffiffiffiffiffi exp : 0 1=2 :
i¼1 σ i 2π
2 σi ¼ et μ MZ
t
( 1=20 !0 1=20 !)
Thus, fY ðyÞ reduces to the product of univariate 1
t0 μ
normal densities. ¼ e exp
t
t
2
1 0
¼ exp t μ þ t
t
0
each of the n components of vector Z is indepen- Moreover, if
12 ¼ 0 (Yð1Þ and Yð2Þ are uncorre-
dent and identically distributed standard uni- lated), then Yð1Þ and Yð2Þ are statistically indepen-
variate normal, for which simulation methods are dent. Recall that the covariance of two
well known. Let X ¼ μ þ
1=2 Z where independent random variables is always zero
1=2
1=2 ¼ so that X ∼ Nðfμ,
Þ. Realizations of but the opposite need not be true. Thus, Yð1Þ and
X can be obtained from the generated samples z as Yð2Þ are statistically independent if and only if
fμ þ
1=2 z where
1=2 can be computed via the
12 ¼
012 ¼ 0:
Cholesky decomposition.
ðYð2Þ μð2Þ Þ0 ¼
12
12
1
22
22 ¼ 0:
• Each of the Yi s is univariate normal.
• All possible subvectors are multivariate normal.
• All marginal distributions are multivariate This implies that X and Yð2Þ are independent,
normal. and we can write
866 Multivariate Normal Distribution
X by σ ijjðpþ1, ..., nÞ, where p is the dimension of the
∼ vector Yð1Þ.
Yð2Þ
" # " #! The partial correlation of Yi and Yj , given Yð2Þ ,
0
11
12
1
22
21 0 is defined by
N ; :
μð2Þ 0
22
σ ijjðpþ1, ..., nÞ
As ρijjðpþ1, ..., nÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
σ iijðpþ1, ..., nÞ σ jjjðpþ1, ..., nÞ
Yð1Þ ¼ X þ μð1Þ þ
12
1
22 Yð2Þ μð2Þ ;
conditional on Yð2Þ ¼ yð2Þ , Yð1Þ is normally dis- Testing for Multivariate Normality
tributed with mean If a vector is multivariate normally distributed,
each individual component follows a univariate
EðXÞ þμð1Þ þ
12
122 yð2Þ μð2Þ Gaussian distribution. However, univariate nor-
|fflffl{zfflffl}
0 mality of each variable in a set is not sufficient for
multivariate normality. Therefore, to test the ade-
and variance VðXÞ ¼
11
12
122
21 .
quacy of the multivariate normality assumption,
In the bivariate normal case, we replace Yð1Þ several authors have presented a battery of tests
and Yð2Þ by Y1 and Y2 , and the conditional distri- consistent with the multivariate framework. For
bution of Y1 given Y2 is normal as follows: this purpose, Kanti V. Mardia proposed multivari-
ate extensions of the skewness and kurtosis statis-
Y1 =ðY2 ¼ y2 Þ ∼ tics, which are extensively used in the univariate
framework to test for normality.
σ1 2
2
N μ1 þ ρ ðy2 μ2 Þ; σ 1 1 ρ
σ2
: Let fXi gm m
i¼1 ¼ fX1i , . . . , Xni gi¼1 be the ith ran-
dom sample from an n-variate distribution, X the
Another special case is with Yð1Þ ¼ Y1 and vector of sample means, and SX the sample covari-
Yð2Þ ¼ ðY2 ; . . . ; Yn Þ0 so that ance matrix. The n-variate skewness and kurtosis
statistics denoted respectively by b1;n and b2;n are
defined as follows:
Y1 Yð2Þ ¼ yð2Þ ∼
N μ1 þ
12
1 y μ ;
1
: m
3
22 ð2Þ ð2Þ 11 12 22 21
1 X m X
0 1
b1, n ¼ Xi X S X X j X
m2 i¼1 j¼1
The mean of this particular
conditional
dis-
m
tribution (μ1 þ
12
22 yð2Þ μð2Þ ) is referred to
1 1X 0 2
b2, n ¼ Xi X S1 X X i X :
m i¼1
as a multiple regression function of Y1 on Y2 , with
the regression coefficient vector being β ¼
12
1
22 .
The Gaussian Copula Function Consider the following multiple linear regres-
sion model:
Recently in multivariate modeling, much attention
has been paid to copula functions. A copula is Y ¼ β0 þ β1 X1 þ β2 X2 þ þ βq Xq þ ε,
a function that links an n-dimensional distribution
function to its one-dimensional margins and is where Y is the response variable, ðX1 ,
itself a continuous distribution function character- X2 , . . . , Xq Þ0 is a vector representing a set of q
izing the dependence structure of the model. explanatory variables, and ε is the error term.
Sklar’s theorem states that under appropriate con- Note that the simple linear regression model is
ditions, the joint density can be written as a prod- a special case with q ¼ 1.
uct of the marginal densities and the copula Suppose that we have n observations on Y and
density. Several copula families are available that on each of the explanatory variables, that is,
can incorporate the relationships between random
variables. Among these families, the Gaussian cop- Y1 ¼ β0 þ β1 X11 þ β2 X12 þ þ βq X1q þ ε1
ula encodes dependence in precisely the same way Y2 ¼ β0 þ β1 X21 þ β2 X22 þ þ βq X2q þ ε2
as the multivariate normal distribution does, using .. .. ,
only pairwise correlations among the variables. . .
However, it does so for variables with arbitrary Yn ¼ β0 þ β1 Xn1 þ β2 Xn2 þ þ βq Xnq þ εn
margins. A multivariate normal distribution arises
whenever univariate normal margins are linked where Eðεi Þ ¼ 0, var εj ¼ σ 2 and covðεi , εj Þ ¼ 0
through a Gaussian copula. The Gaussian copula for i 6¼ j.
function is defined by We can rewrite
0 1
Cðu1 , u2 , . . . , un Þ ¼ Y1
BY C
nρ 1 ðu1 Þ, 1 ðu2 Þ, . . . , 1 ðun Þ , B 2C
B . C¼
B . C
@ . A
where nρ denotes the joint distribution function of Yn
the n-variate standard normal distribution with 0 1
0 1 β0 0 1
linear correlation matrix ρ, and 1 denotes the 1 X11 X12 X1q B ε1
inverse of the univariate standard normal distribu- B β1 CC B C
B1 X2q C
B X21 X22 CB C B ε2 C
tion function. In the bivariate case, the copula B CB β2 C þ B C
B .. .. .. .. CB C B . C
expression can be written as @. . . . AB
B .. C @ .. A
@ . C
A
Z Z 1 Xn1 Xn2 Xnq εn
1 ðuÞ 1 ðvÞ
1 βq
Cðu; vÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
∞ ∞ 2π 1 ρ2
1 x2 2ρxy þ y2 or else
exp dxdy,
2 ð1 ρ2 Þ
Y ¼ Xβ þ ε,
where ρ is the usual linear correlation coefficient where EðεÞ ¼ 0 and covðεÞ ¼ σ 2 In .
of the corresponding bivariate normal distribution. Note that in the previous expression, ε is
a multivariate normal random vector whereas
Multiple Linear Regression Xβ is a vector of constants. Thus, Y is a linear
combination of a multivariate normally distrib-
and Sampling Distribution
uted vector. It follows that Y is also multivariate
In this section, the multiple regression model and normal with mean EðYÞ ¼ Xβ and covariance
the associated sampling distribution are presented covðYÞ ¼ σ 2 In .
as an illustration of the usefulness of the multivari- The goal of an analysis of data of this form is
ate normal distribution in statistics. to estimate the regression parameter β. The least
868 Multivariate Normal Distribution
squares estimate of β is found by minimizing the See also Central Limit Theorem; Coefficients
sum of squared deviations of Correlation, Alienation, and Determination; Copula
Functions; Kurtosis; Multiple Regression; Normal
X
n
2 Distribution; Normality Assumption; Partial
Yj β0 β1 Xj1 þ β2 Xj2 þ þ βq Xjq :
j¼1
Correlation; Sampling Distributions
Chiraz Labidi
N
in that there is a respect for the relativity and
NARRATIVE RESEARCH multiplicity of truth in regard to the human
sciences. Narrative researchers rely on the episte-
mological arguments of such philosophers as
Narrative research aims to explore and conceptu- Paul Ricoeur, Martin Heidegger, Edmund
alize human experience as it is represented in tex- Husserl, Wilhelm Dilthey, Ludwig Wittgenstein,
tual form. Aiming for an in-depth exploration of Mikhail Bakhtin, Jean-Francois Lyotard, and
the meanings people assign to their experiences, Hans-Georg Gadamer. Although narrative rese-
narrative researchers work with small samples of archers differ in their view of the possibility of
participants to obtain rich and free-ranging dis- objectively conceived ‘‘reality,’’ most agree with
course. The emphasis is on storied experience. Donald Spence’s distinction between narrative
Generally, this takes the form of interviewing peo- and historical truth. Factuality is of less interest
ple around the topic of interest, but it might also than how events are understood and organized,
involve the analysis of written documents. Narra- and all knowledge is presumed to be socially
tive research as a mode of inquiry is used by constructed.
researchers from a wide variety of disciplines, Ricoeur, in his seminal work Time and Narra-
which include anthropology, communication stud- tive, argues that time is organized and experienced
ies, cultural studies, economics, education, history, narratively; narratives bring order and meaning to
linguistics, medicine, nursing, psychology, social the constantly changing flux. In its simplest form,
work, and sociology. It encompasses a range of our experience is internally ordered as ‘‘this hap-
research approaches including ethnography, phe- pened, then that happened’’ with some (often
nomenology, grounded theory, narratology, action causal) connecting link in between. Narrative is also
research, and literary analysis, as well as such central to how we conceive of ourselves; we create
interpretive stances as feminism, social constuc- stories of ourselves to connect our actions, mark
tionism, symbolic interactionism, and psycho- our identity, and distinguish ourselves from others.
analysis. This entry discusses several aspects of Questions about how people construct them-
narrative research, including its epistemological selves and others in various contexts, under vari-
grounding, procedures, analysis, products, and ous conditions, are the focus of narrative research.
advantages and disadvantages. Narrative research paradigms, in contrast to
hypothesis-testing ones, have as their aims describ-
ing and understanding rather than measuring and
Epistemological Grounding
predicting, focusing on meaning rather than causa-
The epistemological grounding for narrative research tion and frequency, interpretation rather than sta-
is on a continuum of postmodern philosophical ideas tistical analysis, and recognizing the importance of
869
870 Narrative Research
language and discourse rather than reduction to Narrative researchers invite participants to
numerical representation. These approaches are describe in detail—tell the story of—either a partic-
holistic rather than atomistic, concern themselves ular event or a significant aspect or time of life
with particularity rather than universals, are inter- (e.g., a turning point), or they ask participants to
ested in the cultural context rather than trying to narrate an entire life story. Narration of experi-
be context-free, and give overarching significance ence, whether of specific events or entire life histo-
to subjectivity rather than questing for some kind ries, involves the subjectivity of the actor, with
of objectivity. attendant wishes, conflicts, goals, opinions, emo-
Narrative research orients itself toward under- tions, worldviews, and morals, all of which are
standing human complexity, especially in those open to the gaze of the researcher. Such narratives
cases where the many variables that contribute also either implicitly or explicitly involve settings
to human life cannot be controlled. Narrative that include those others who are directly involved
research aims to take into account—and interpre- in the events being related and also involve all
tively account for—the multiple perspectives of those relationships that have influenced the narra-
both the researched and researcher. tor in ever-widening social circles. The person is
Jerome Bruner has most championed the legiti- assumed to be speaking from a specific position in
mization of what he calls ‘‘narrative modes of culture and in historical time. Some of this posi-
knowing,’’ which privileges the particulars of lived tionality is reflected in the use of language and
experience rather than constructs about variables concepts with which a person understands her or
and classes. It aims for the understanding of lives his life. Other aspects of context are made explicit
in context rather than through a prefigured and as the researcher is mindful of the person’s experi-
narrowing lens. Meaning is not inherent in an act ence of herself or himself in terms of gender, race,
or experience but is constructed through social dis- culture, age, social class, sexual orientation,
course. Meaning is generated by the linkages the nationality, etc. Participants are viewed as unique
participant makes between aspects of the life he individuals with particularity in terms of social
or she is living and by the explicit linkages the location; a person is not viewed as representative
researcher makes between this understanding and of some universal and interchangeable, randomly
interpretation, which is meaning constructed at selected ‘‘subject.’’
another level of analysis. People, however, do not ‘‘have’’ stories of their
lives; they create them for the circumstance in
which the story will be told. No two inter-
Life Is a Story
viewers will obtain exactly the same story from
One major presupposition of narrative research is an individual interviewee. Therefore, a thor-
that humans experience their lives in emplotted oughly reflexive analysis of the parameters and
forms resembling stories or at least communicate influences on the interview situation replaces
about their experiences in this way. People use nar- concern with reliability.
rative as a form of constructing their views of the A narrative can be defined as a story of
world; time itself is constructed narratively. Impor- a sequence of events. Narratives are organized so
tant events are represented as taking place through as to place meanings retrospectively on events,
time, having roots in the past, and extending in with events described in such a way as to express
their implications into the future. Life narratives the meanings the narrator wishes to convey. Nar-
are also contextual in that persons act within situ- rative is a way of understanding one’s own (and
ational contexts that are both immediate and more others’) action, of organizing personal experience,
broadly societal. The focus of research is on what both internal and external, into a meaningful
the individuals think they are doing and why they whole. This involves attributing agency to the
think they are doing so. Behavior, then, is always characters in the narrative and inferring causal
understood in the individual’s context, however he links between the events. In the classic formula-
or she might construct it. Thus, narratives can be tion, a narrative is an account with three compo-
examined for personal meanings, cultural mean- nents: a beginning, a middle, and an end. William
ings, and the interaction between these. Labov depicts all narratives as having clauses that
Narrative Research 871
orient the reader to the story, tell about the events, In interview-based designs, which are the most
or evaluate the story—that is, instruct the listener widespread form of narrative research, partici-
or reader as to how the story is to be understood. pants who fit into the subgroup of interest are
The evaluation of events is of primary interest to invited to be interviewed at length (generally 1–4
the narrative researcher because this represents the hours). Interviews are recorded and then tran-
ways in which the narrator constructs a meaning scribed. The narrative researcher creates ‘‘experi-
(or set of meanings) within the narrative. Such ence-near’’ questions related to the conceptual
meanings, however, are not viewed to be either question that might be used to encourage partici-
singular or static. Some narrative theorists have pants to tell about their experiences. This might be
argued that the process of creating an autobio- a request for a full life story or it might be a ques-
graphical narrative is itself transforming of self tion about a particular aspect of life experience
because the self that is fashioned in the present such as life transitions, important relationships, or
extends into the future. Thus, narrative research is responses to disruptive life events.
viewed to be investigating a self that is alive and Narrative research meticulously attends to the
evolving, a self that can shift meanings, rather than process of the interview that is organized in as
a fixed entity. unstructured a way as possible. The narrative res-
Narrative researchers might also consider the earcher endeavors to orient the participant to the
ways in which the act of narration is performative. question of interest in the research and then inter-
Telling a story constructs a self and might be used vene only to encourage the participant to continue
to accomplish a social purpose such as defending the narration or to clarify what seems confusing
the self or entertaining someone. Thus, the focus is to the researcher. Inviting stories, the interviewer
not only on the content of what is communicated asks the participant to detail his or her experiences
in the narrative but also on how the narrator con- in rich and specific narration. The interviewer
structs the story and the social locations from takes an empathic stance toward the interviewees,
which the narrator speaks. Society and culture also trying to understand their experience of self and
enable and constrain certain kinds of stories; world from their point of view. Elliott Mishler,
meaning making is always embedded in the con- however, points out that no matter how much the
cepts that are culturally available at a particular interviewer or researcher attempts to put aside his
time, and these might be of interest in a narrative or her own biases or associations to the interview
research project. Narrative researchers, then, content, the researcher has impact on what is told
attend to the myriad versions of self, reality, and and this must be acknowledged and reflected on.
experience that the storyteller produces through Because such interviews usually elicit highly per-
the telling. sonal material, confidentiality and respect for the
interviewee must be assured. The ethics of the
interview are carefully considered in advance, dur-
Procedures
ing the interview itself and in preparation of the
Narrative research begins with a conceptual ques- research report.
tion derived from existing knowledge and a plan Narrative research questions tend to focus on
to explore this question through the narratives of individual, developmental, and social processes
people whose experience might illuminate the that reflect how experience is constructed both
question. Most narrative research involves per- internally and externally. Addressing questions
sonal interviews, most often individual, but some- that cannot be answered definitively, narrative
times in groups. Some narrative researchers might research embraces multiple interpretations rather
(also) use personal documents such as journals, than aiming to develop a single truth. Rooted in
diaries, memoirs, or films as bases for their analy- a postmodern epistemology, narrative approaches
ses. Narrative research uses whatever storied mate- to research respect the relativity of knowing—the
rials are available or can be produced from the meanings of the participant filtered through the
kinds of people who might have personal knowl- mind of the researcher with all its assumptions and
edge and experiences to bring to bear on the a priori meanings. Knowledge is presumed to be
research question. constructed rather than discovered and is assumed
872 Narrative Research
to be localized, perspectival, and occurring within to discover patterns across individual narrative
intersubjective relationships to both participants interview texts or to explore what might create
and readers. ‘‘Method’’ then becomes not a set of differences between people in their narrated
procedures and techniques but ways of thinking experiences.
about inquiry, modes of exploring questions, and There are many approaches to analyses, with
creative approaches to offering one’s constructed some researchers focusing on meanings through
findings to the scholarly community. All communi- content and others searching through deconstruct-
cation is through language that is understood to be ing the use and structure of language as another
always ambiguous and open to interpretation. set of markers to meanings. In some cases, resea-
Thus, the analytic framework of narrative research rchers aim to depict the layers of experience
is in hermeneutics, which is the science of impre- detailed in the narratives, preserving the point of
cise and always shifting meanings. view, or voice, of the interviewee. At other times,
researchers might try to go beyond what is said
and regard the narrated text as a form of disguise;
Analysis
this is especially true when what is sought are
The analysis of narrative research texts is pri- unconscious processes or culturally determined
marily aimed at inductively understanding the aspects of experience that are embedded rather
meanings of the participant and organizing them than conscious.
at some more conceptual level of understanding. The linguistic emphasis in some branches of
This might involve a close reading of an indivi- narrative inquiry considers the ways in which lan-
dual’s interview texts, which includes coding for guage organizes both thought and experience.
particular themes or extracting significant pas- Other researchers recognize the shaping function
sages for discussion in the report. The researcher of language but treat language as transparent as
looks inductively for patterns, and the kinds of they focus more on the content of meanings that
patterns recognized might reflect the researcher’s might be created out of life events.
prior knowledge about the phenomena. The pro- The purpose of narrative research is to pro-
cess of analysis is one of piecing together data, duce a deep understanding of dynamic processes.
making the invisible apparent, deciding what is No effort is made to generalize about popula-
significant and what is insignificant, and linking tions. Thus, statistics, which aims to represent
seemingly unrelated facets of experience toge- populations and the distribution of variables
ther. Analysis is a creative process of organizing within them, have little or no place in narrative
data so the analytic scheme will emerge. Texts research. Rather, knowledge is viewed to be
are read multiple times in what Friedrich localized in the analysis of the particular people
Schleiermacher termed a ‘‘hermeneutic circle,’’ studied and generalization about processes that
a process in which the whole illuminates the might apply to other populations is left to the
parts that in turn offers a fuller and more com- reader. That is, in a report about the challenges
plex picture of the whole, which then leads to of immigration in a particu-lar population, the
a better understanding of the parts, and so on. reader might find details of the interactive pro-
Narrative researchers focus first on the voices cesses that might illuminate the struggles of
within each narrative, attending to the layering of another population in a different locale—or even
voices (subject positions) and their interaction, as people confronting other life transitions.
well as the continuities, ambiguities, and disjunc- Narrative research avoids having a predeter-
tions expressed. The researcher pays attention to mined theory about the person that the interview
both the content of the narration (‘‘the told’’) and or the life-story is expected to support. Although
the structure of the narration (‘‘the telling’’). Nar- no one is entirely free of preconceived ideas and
rative analysts might also pay attention to what is expectations, narrative researchers try to come to
unsaid or unsayable by looking at the structure of their narrators as listeners open to the surprising
the narrative discourse and markers of omissions. variation in their social world and private lives.
After each participant’s story is understood as well Although narrative researchers try to be as knowl-
as possible, cross-case analysis might be performed edgeable as possible about the themes that they
Narrative Research 873
are studying to be maximally sensitive to nuances capturing data from the inside of the actors with
of meaning, they are on guard again inflicting a view to understanding and conceptualizing
meaning in the service of their own ends. their meaning making in the contexts within which
they live. Narrative researchers recognize that
many interpretations of their observations are pos-
Products
sible and they argue their interpretive framework
Reports of narrative research privilege the words through careful description of what they have
of the participants, in what Clifford Geertz calls observed.
‘‘thick description,’’ and present both some of the Narrative researchers also recognize that they
raw data of the text as well as the analysis. Offer- themselves are narrators as they present their orga-
ing as evidence the contextualized words of the nization and interpretation of their data. They
narrator lends credence to the analysis suggested endeavor to make their work as interpreters trans-
by the researcher. The language of the research parent, writing about their own interactions with
report is often near to experience as lived rather their participants and their data and remaining
than as obscured by scientific jargon. Even Sig- mindful of their own social location and personal
mund Freud struggled with the problem of making predilections. This reflexive view of researcher as
the study of experience scientific, commenting in narrator opens questions about the representation
1893 that the nature of the subject was responsible of the other and the nature of interpretive author-
for his works reading more like short stories than ity, and these are addressed rather than elided.
customary scientific reports. The aim of a narrative
research report is to offer interpretation in a form
Advantages and Disadvantages
that is faithful to the phenomena. In place of form-
neutral ‘‘objectivized’’ language, many narrative A major appeal of narrative research is the oppor-
researchers concern themselves with the poetics of tunity to be exploratory and make discoveries free
their reports and strive to embody the phenomena of the regimentation of prefabricated hypotheses,
in the language they use to convey their meanings. contrived variables, control groups, and statistics.
Narrative researchers stay respectful of their parti- Narrative research can be used to challenge con-
cipants and reflect on how they are representing ceptual hegemony in the social sciences or to
‘‘the other’’ in the published report. extend the explanatory power of abstract theoreti-
Some recent narrative research has concerned cal ideas. Some of the most paradigm-defining
such topics as how people experience immigration, conceptual revolutions in the study of human
illness, identity, divorce, recovery from addictions, experience have come from narrative research—
belief systems, and many other aspects of human Sigmund Freud, Erik Erikson, and Carol Gilligan
experience. Any life experiences that people can being the most prominent examples. New narra-
narrate or represent become fertile ground for nar- tive researchers, however, struggle with the vaguely
rative research questions. The unity of a life resides defined procedures on which this research depends
in a construction of its narrative, a form in which and with the fact that interesting results cannot be
hopes, dreams, despairs, doubts, plans, and emo- guaranteed in advance. Narrative research is also
tions are all phrased. labor intensive, particularly in the analysis phase,
Although narrative research is generally con- where text must be read and reread as insights and
cerned with individuals’ experience, some narra- interpretations develop.
tive researchers also consider narratives that Narrative research is not generalizable to popu-
particular collectives (societies, groups, or orga- lations but rather highlights the particularities of
niza-tions) tell about themselves, their histories, experience. Many narrative researchers, however,
their dominant mythologies, and their aspira- endeavor to place the individual narratives they
tions. Just as personal narratives create personal present in a broader frame, comparing and con-
identity, group narratives serve to bond a com- trasting their conclusions with the work of others
munity and distinguish it from other collectives. with related concerns. All people are like all other
A good narrative research report will detail people, like some other people—and also are
a holistic overview of the phenomena under study, unique. Readers of narrative research are invited
874 National Council on Measurement in Education
assessments in education, and disseminating infor- The ITEMS units, which first appeared in 1987,
mation regarding new developments in educational are available for free and can be downloaded from
testing and the proper use of tests. To accomplish the ITEMS page on the NCME website. As of
these goals, NCME hosts an annual conference 2010, there are 22 modules covering a broad range
each year (jointly scheduled with the annual con- of topics such as how to equate tests, evaluate dif-
ference of the American Educational Research ferential item functioning, or set achievement level
Association), publishes two highly regarded jour- standards on tests.
nals covering research and practice in educational NCME has also partnered with other organizations
measurement, and partners with other professional to publish books and other materials designed to pro-
organizations to develop and disseminate guide- mote fair or improved practices related to educational
lines and standards for appropriate educational measurement. The most significant partnership has
assessment practices and to further the understand- been with the Joint Committee on Testing Standards,
ing of the strengths and limitations of educational which produced the Standards for Educational and
tests. In the following sections, some of the most Psychological Testing in 1999 as well as the four previ-
important activities of NCME are described. ous versions of those standards (in 1954, 1966, 1974,
and 1985). NCME also partnered with the American
Council on Education to produce four versions of the
Dissemination Activities
highly acclaimed book Educational Measurement.
The NCME publishes two peer-reviewed journals, Two books on evaluating teachers were also spon-
both of which have four volumes per year. The first sored by NCME: the Handbook of Teacher Evalua-
is the Journal of Educational Measurement (JEM), tion and the New Handbook of Teacher Evaluation:
which was first published in 1963. JEM publishes Assessing Elementary and Secondary School Teachers.
original research related to educational assessment, In addition to publishing journals, instructional
particularly advances in statistical techniques such modules, and books, NCME has also partnered
as equating tests (maintaining score scales over with other professional organizations to publish
time), test calibration (e.g., using item response the- material to inform educators or the general public
ory), and validity issues related to appropriate test about important measurement issues. For example,
development and use (e.g., techniques for evaluat- in 1990 it partnered with the American Federation
ing item and test bias). It also publishes reviews of of Teachers (AFT) and the National Education
books related to educational measurement and the- Association (NEA) to produce the Standards for
oretical articles related to major issues and develop- Teacher Competence in the Educational Assess-
ments in educational measurement (e.g., reliability ment of Students. It has also been an active mem-
and validity theory). The second journal published ber on the Joint Committee on Testing Practices
by NCME is Educational Measurement: Issues and (JCTP) that produced the ABCs of Testing, which
Practice (EM:IP), which was first published in is a video and booklet designed to inform parents
1982. EM:IP focuses on more applied issues, typi- and other lay audiences about the use of tests in
cally less statistical in nature, that are of broad schools and about important characteristics of
interest to measurement practitioners. According to quality educational assessments. NCME also
the EM:IP page on the NCME website, the pri- worked with JCTP to produce the Code of Fair
mary purpose of EM:IP is ‘‘to promote a better Testing Practices, which describes the responsibili-
understanding of educational measurement and to ties test developers and test users have for ensuring
encourage reasoned debate on current issues of fair and appropriate testing practices. NCME disse-
practical importance to educators and the public. minates this document for free at its website.
EM:IP also provides one means of communication NCME also publishes a quarterly newsletter, which
among NCME members and between NCME.’’ can also be downloaded for free from its website.
In addition to the two journals, NCME also
publishes the Instructional Topics in Educational
Annual Conference
Measurement Series (ITEMS), which are instruc-
tional units on specific measurement topics of inter- In addition to the aforementioned publications, the
est to measurement researchers and practitioners. NCME’s annual conference is another mechanism
876 Natural Experiments
with which it helps disseminate new findings and American Federation of Teachers, National Education
research on educational measurement. The annual Association, & National Council on Measurement in
conference, typically held in March or April, fea- Education. (1990). Standards for teacher competence
tures three full days of paper sessions, symposia, in the educational assessment of students. Educational
Measurement: Issues and Practice 9, 30–34.
invited speakers, and poster sessions in which psy-
Brennan, R. L. (Ed). (2006). Educational measurement
chometricians and other measurement practitioners (4th ed.). Washington, DC: American Council on
can learn and dialog about new developments and Education/Praeger.
issues in educational measurement and research. Coffman, W. (1989). Past presidents’ committee: A look
About 1,200 members attend the conference each at the past, present, and future of NCME. 1. A look at
year. the past. (ERIC Document Reproduction Service No.
ED308242).
Joint Committee on Testing Practices. (1988). Code of
Governance Structure fair testing practices in education. Washington, DC:
American Psychological Association.
The governance structure of NCME consists of an
Joint Committee on Testing Practices. (1993). ABCs of
Executive Committee (President, President-Elect, testing. Washington, DC: National Council on
and Past President) and a six-member Board of Measurement in Education.
Directors. The Board of Directors includes all Joint Committee on Testing Practices. (2004). Code of
elected positions, and including the President-Elect fair testing practices in education. Washington, DC:
(also referred to as Vice President). In addition, American Psychological Association.
NCME has 20 volunteer committees that are run Lindquist, E. F. (Ed.). (1951). Educational measurement.
by its members. Examples of these committees Washington, DC: American Council on Education.
include the Outreach and Partnership Committee Linn, R. L. (Ed.). (1989). Educational measurement (3rd
and the Diversity Issues and Testing Committee. ed.). Washington, DC: American Council on
Education.
Millman, J. (1981). Handbook of teacher evaluation.
Joining NCME Beverly Hills, CA: Sage.
Millman, J., & Darling-Hammond, L. (1990). The new
NCME is open to all professionals and member- handbook of teacher evaluation: Assessing elementary
ship includes subscriptions to the NCME Newslet- and secondary school teachers. Newbury Park, CA:
ter, JEM, and EM:IP. Graduate students can join Sage.
for a reduced rate and can receive all three publi- Thorndike, R. L. (Ed.). (1971). Educational measurement
cations as part of their membership. All profes- (2nd ed.). Washington, DC: American Council on
sionals interested in staying current with respect to Education.
new developments and research related to assess-
ing students are encouraged to become members. Websites
To join the NCME visit the NCME website or
write the NCME Central Office at 2810 Cross- National Council on Measurement in Education:
roads Drive, Suite 3800, Madison WI, 53718. http://www.ncme.org
Stephen G. Sireci
they are fortuitous. They do, however, have dis- (b) maternal depression is virtually always accom-
tinct advantages over observational studies and panied by other risks that are also reliably linked
might, in some circumstances, address questions with children’s maladjustment, such as poor par-
that randomized controlled trials could not enting and marital conflict; and (c) mate selection
address. A key feature of natural experiments is for psychiatric disorders means that depressed
that they offer insight into causal processes, which mothers are more likely to have a partner with
is one reason why they have an established role in a mental illness, which confounds any specific
developmental science. ‘‘effect’’ that investigators might wish to attribute
Natural experiments represent an important to maternal depression per se. Most of the major
research tool because of the methodological limits risk factors relevant to psychological well-being
of naturalistic and experimental designs and the and public health co-occur; in general terms, risk
need to triangulate and confirm findings across exposures are not distributed randomly in the
multiple research designs. Notwithstanding their population. Indeed, one of the more useful lessons
own set of practical limitations and threats to gen- from developmental science has been to demon-
eralizability of the results, natural experiments strate the ways in which exposures to risk accrue
have the potential to deconfound alternative mod- in development.
els and accounts and thereby contribute signifi- One response to the problems in selection bias
cantly to developmental science and other areas of or confounded risk exposure is to address the
research. This entry discusses natural experiments problem analytically. That is, even if, for example,
in the context of other research designs and then maternal depression is inherently linked with
illustrates how their use in developmental science compromised parenting and family conflict, the
has provided information about the relationship ‘‘effect’’ of maternal depression might nevertheless
between early exposure to stress and children’s be derived if the confounded variables (compro-
development. mised parenting and family conflict) are statisti-
cally controlled for. There are some problems with
that solution, however. If risk processes are con-
The Scientific Context of Natural Experiments
founded in nature, then statistical controlling for
The value of natural experiments is best appreci- one or the other is not a satisfying solution; inter-
ated when viewed in the context of other designs. pretations of the maternal depression ‘‘effect’’ will
A brief discussion of other designs is therefore be possible but probably not (ecologically) valid.
illustrative. Observational or naturalistic studies— Sampling design strategies to obtain the same
cross-sectional or longitudinal assessments in kind of leverage, such as sampling families with
which individuals are observed and no experimen- depressed mothers only if there is an absence of
tal influence is brought to bear on them—generally family conflict, will yield an unrepresentative sam-
cannot address causal claims. That is because ple of affected families with minimal generalizabil-
a range of methodological threats, including selec- ity. Case-control designs try to gain some leverage
tion biases and coincidental or spurious associa- over cohort observational studies by tracking
tions, undermine causal claims. So, for example, in a group or groups of individuals, some of whom
the developmental and clinical psychology litera- have a condition(s) of interest. Differences between
ture, there is considerable interest in understanding groups are inferred to be attributable to the condi-
the impact of parental mental health—maternal tion(s) of interest because the groups were
depression is probably the most studied example— matched on key factors. That is not always possi-
on children’s physical and mental development. ble and the relevant factors to control for are not
Dozens of studies have addressed this question always known; as a result, between-group and
using a variety of samples and measures. However, even within-subject variation in these designs is
almost none of these studies—even large-scale subject to confounders.
cohort and population studies—are equipped to A potential methodological solution is offered
identify causal mechanisms for several reasons, by experimental designs. So, for example, testing
including (a) genetic transmission is confounded the maternal depression hypothesis referred to pre-
with family processes and other psychosocial risks; viously might be possible to the extent that some
878 Natural Experiments
affected mothers are randomly assigned to treat- a single design. Researchers are now accustomed
ment for depression. That would offer greater to defining an effect or association as robust if it
purchase on the question of whether maternal is replicated across samples and measures. Also,
depression per se was a causal contributor to a finding should replicate across design. No sin-
children’s adjustment difficulties. Interestingly, gle research sample and no single research design
intervention studies have shown that there are is satisfactory for testing causal hypotheses or
a great many questions about causal processes that inferring causal mechanisms.
emerge even after a successful trial. For example, Finally, identifying natural experiments can be
cognitive-behavioral treatment might successfully an engaging and creative process, and studies
resolve maternal depression and, as a result, chil- based on natural experiments are far less expensive
dren of the treated mothers might show improved and arduous to investigate—they occur naturally—
outcomes relative to children whose depressed than those using conventional research designs;
mothers were not treated. It would not necessarily they can also be common. Thus, dramatic shifts in
follow, however, that altering maternal depression income might be exploited to investigate income
was the causal mediator affecting child behavior. dynamics and children’s well-being; cohort
It might be that children’s behavior improved changes in the rates of specific risks (e.g., divorce)
because the no-longer-depressed mothers could might be used to examine psychosocial accounts
engage as parents in a more effective manner and for children’s adjustment problems. Hypotheses
there was a decrease in inter-parental conflict, or about genetic and/or psychosocial risk exposure
any of several other secondary effects of the might be addressed using adoption and twin
depression treatment. In other words, questions designs, and many studies exploit the arbitrariness
about causal mechanisms are not necessarily resol- of age cut-off for school to contrast exposure with
ved fully by experimental designs. maturation accounts of reading, language, and
Investigators in applied settings are also aware mathematic ability; the impact of compulsory
that some contexts are simply not amenable to schooling; and many other practical and concep-
randomized control. School-based interventions tual questions.
sometimes hit resistance to random assignment Like all other forms of research design, natu-
because principals, teachers, or parents object to ral experiments have their own special set of lim-
the idea that some children needing intervention itations. But, they can offer both novel and
might not get the presumed better treatment. confirmatory findings. Examples of how natural
Court systems are often nonreceptive experimental experiments have informed the debate on early
proving grounds. That is, no matter how compel- risk exposure and children’s development are
ling data from a randomized control trial might be reviewed below.
to address a particular question, there are circum-
stances in which a randomized control trial is
extremely impractical or unethical. Natural Experiments to Examine the
Natural experiments are, therefore, particu-
Long-Term Effects of Early Risk Exposure
larly valuable where traditional nonexperimental
designs might not be scientifically inadequate or Understanding the degree to which, and by what
where experimental designs may be practical or mechanisms, early exposure to stress has long-term
ethical. And, natural experiments are useful sci- effects is a primary question for developmental sci-
entific tools even where other designs might be ence with far-reaching clinical and policy applica-
judged as capable of testing the hypothesis of tions. This area of inquiry has been extensively
interest. That is because of the need for findings and experimentally studied in animal models. But
to be confirmed not only by multiple studies but animal studies are inadequate for deriving clinical
also by multiple designs. That is, natural experi- and public health meaning; research in humans is
ments can provide a helpful additional scientific essential. However, sound investigation to inform
‘‘check’’ on findings generated from naturalistic the debate in humans has been overshadowed
or experimental studies. There are many illustra- by claims that might overplay the evidence, as in
tions of the problems in relying on findings from the case of extending animal findings to humans
Natural Experiments 879
willy-nilly. The situation is compounded by the and because this is context in which natural experi-
general lack of relevant human studies that have ments of altering care are conducted. Clearly, there
leverage for deriving claims about early experience are inherent complications, but the findings have
and exposure per se. That is, despite the hundreds provided some of the most interesting data in clini-
of studies that assess children’s exposure to early cal and developmental psychology.
risk, almost none can differentiate the effects of An even more extreme context involves children
early risk exposure from later risk exposure because who experienced gross deprivation via institutional
the exposure to risk—maltreatment, poverty, and care and were then adopted into low-/normal-risk
parental mental illness—is continuous rather than homes. There are many studies of this sort. One is
precisely timed or specific to the child’s early life. the English and Romanian Adoptees (ERA) study,
Intervention studies have played an important role which is a long-term follow-up of children who
in this debate, and many studies now show long- were adopted into England after institutional rear-
term effects of early interventions. In contrast, ing in Romanian; the study also includes an early
because developmental timing was not separated adopted sample of children in England as a com-
from intervention intensity in most cases, these stud- parison group. A major feature of this particular
ies do not resolve issues about early experience as natural experiment—and what makes it and simi-
such. In other words, conventional research designs lar studies of exinstitutionalized children notewor-
have not had much success in tackling major ques- thy—is that there was a remarkable discontinuity
tions about early experience. It is not surprising, in caregiving experience, from the most severe to
then, that natural experiments have played such a normal-risk setting. That feature offers unparal-
a central role in this line of investigation. lel leverage for testing the hypothesis that it is
Several different forms of natural experiments early caregiving risk that has persisting effects on
to study the effects of early exposure have been long-term development. The success of the natural
reported. One important line of inquiry is from the experiment design depends on many considera-
Dutch birth cohort exposed to prenatal famine tions, including the representativeness of the fami-
during the Nazi blockade. Alan S. Brown and col- lies who adopted from Romania to the general
leagues found that the rate of adult unipolar and population of families, for example. A full account
bipolar depression requiring hospitalization was of this issue is not within the scope of this entry,
increased among those whose mothers experienced but it is clear that the impact of findings from nat-
starvation during the second and third trimesters ural experiments needs to be judged in relation to
of pregnancy. The ability of the study to contrast the kinds of sampling and other methodological
rates of disorder among individuals whose mothers features.
were and were not pregnant during the famine The findings from studies of exinstitutionalized
allowed unprecedented experimental ‘‘control’’ on samples correspond across studies. So, for exam-
timing of exposure. A second feature that makes ple, there is little doubt now from long-term
the study a natural experiment is that it capitalized follow-up assessments that early caregiving depri-
on a situation that is ethically unacceptable and so vation can have long-term impact on attachment
impossible to design on purpose. and intellectual development, with a sizable minor-
Another line of study that has informed the ity of children showing persisting deficits many
early experience debate concerns individuals whose years after the removal from the institutional set-
caregiving experience undergoes a radical ting and despite many years in a resourceful, car-
change—far more radical than any traditional ing home environment. Findings also show that
psychological intervention could create. Of course, individual differences in response to early severe
radical changes in caregiving do not happen ordi- deprivation are substantial and just as continuous.
narily. A notable exception is those children who Research into the effects of early experience has
are removed from abusive homes and placed into depended on these natural experiments because
nonabusive or therapeutic settings (e.g., foster conventional research designs were either impracti-
care). Studies of children in foster care are, there- cal or unethical.
fore, significant because this is a population for
whom the long-term outcomes are generally poor Thomas G. O’Connor
880 Naturalistic Inquiry
See also Case-Only Design; Case Study; Narrative and sociology, including participant observation,
Research; Observational Research; Observations direct observation, ethnographic methods, case
studies, grounded theory, unobtrusive methods,
and field research methods. Working in the
Further Readings
places where people live and work, naturalistic
Anderson, G. L., Limacher, M., Assaf, A. R., Bassford, researchers draw on observations, interviews,
T., Beresford, S. A., Black, H., et al. (2004). Effects of and other sources of descriptive data, as well
conjugated equine estrogen in postmenopausal women as their own subjective experiences, to create
with hysterectomy: The women’s health initiative rich, evocative descriptions and interpretations
randomized controlled trial. Journal of the American of social phenomena. Naturalistic inquiry desi-
Medical Association, 291, 1701–1712.
gns are valuable for exploratory research, partic-
Beckett, C., Maughan, B., Rutter, M., Castle, J., Colvert,
E., Groothues, C., et al. (2006). Do the effects of early
ularly when relevant theoretical frameworks are
severe deprivation on cognition persist into early not available or when little is known about the
adolescence? Findings from the English and Romanian people to be investigated. The characteristics,
adoptees study. Child Development, 77; 696–711. methods, indicators of quality, philosophical
Campbell, F. A., Pungello, E. P., Johnson, S. M., foundations, history, disadvantages, and advan-
Burchinal, M., & Ramey, C. T. (2001). The tages of naturalistic research designs are descri-
development of cognitive and academic abilities: bed below.
Growth curves from an early childhood educational
experiment. Developmental Psychology, 37, 231–242.
Collishaw, S., Goodman, R., Pickles, A., & Maughan, B.
(2007). Modelling the contribution of changes in Characteristics of Naturalistic Research
family life to time trends in adolescent conduct
problems. Social Science and Medicine, 65, Naturalistic inquiry involves the study of a single
2576–2587. case, usually a self-identified group or community.
Grady, D., Rubin, S. M., Petitti, D. B., Fox, C. S., Black, Self-identified group members are conscious of
D., Ettinger, B., et al. (1992). Hormone therapy to boundaries that set them apart from others. When
prevent disease and prolong life in postmenopausal qualitative (naturalistic) researchers select a case
women. Annals of Internal Medicine, 117; for study, they do so because it is of interest in its
1016–1037. own right. The aim is not to find a representative
O’Connor, T. G., Caspi, A., DeFries, J. C., & Plomin, R.
case from which to generalize findings to other,
(2000). Are associations between parental divorce and
children’s adjustment genetically mediated? An
similar individuals or groups. It is to develop inter-
adoption study. Developmental Psychology, 36, pretations and local theories that afford deep
429–437. insights into the human experience.
Tizard, B., & Rees, J. (1975). The effect of early Naturalistic inquiry is conducted in the field,
institutional rearing on the behavioral problems and within communities, homes, schools, churches,
affectional relationships of four-year-old children. hospitals, public agencies, businesses, and other set-
Journal of Child Psychology and Psychiatry, 16, tings. Naturalistic researchers spend large amounts
61–73. of time interacting directly with participants. The
researcher is the research instrument, engaging in
daily activities and conversations with group mem-
bers to understand their experiences and points of
NATURALISTIC INQUIRY view. Within this tradition, language is considered
a key source of insight into socially constructed
Naturalistic inquiry is an approach to under- worlds. Researchers record participants’ words
standing the social world in which the researcher and actions in detail with minimal interpretation.
observes, describes, and interprets the experi- Although focused on words, narratives, and dis-
ences and actions of specific people and groups course, naturalistic researchers learn through all of
in societal and cultural context. It is a research their senses. They collect data at the following
tradition that encompasses qualitative research experiential levels: cognitive, social, affective, phys-
methods originally developed in anthropology ical, and political/ideological. This strategy adds
Naturalistic Inquiry 881
depth and texture to the body of data qualitative Participants are selected based on the purpose of
researchers describe, analyze, and interpret. the study and the questions under investigation,
Naturalistic researchers study research problems which are refined as the study proceeds. This strat-
and questions that are initially stated broadly then egy might increase the possibility that unusual
gradually narrowed during the course of the study. cases will be identified and included in the study.
In non-naturalistic, experimental research designs, Purposive sampling supports the development of
terms are defined, research hypotheses stated, theories grounded in empirical data tied to specific
and procedures for data collection established in local settings.
advance before the study begins. In contrast,
qualitative research designs develop over time as
Analyzing and Interpreting Data
researchers formulate new understandings and
refine their research questions. Throughout the The first step in qualitative data analysis
research process, naturalistic researchers modify involves transforming experiences, conversations,
their methodological strategies to obtain the kinds and observations into text (data). When naturalis-
of data required to shed light on more focused tic researchers analyze data, they review field
or intriguing questions. One goal of naturalistic notes, interview transcripts, journals, summaries,
inquiry is to generate new questions that will and other documents looking for repeated patterns
lead to improved observations and interpretations, (words, phrases, actions, or events) that are salient
which will in turn foster the formulation of still bet- by virtue of their frequency. In some instances, the
ter questions. The process is circular but ends when researcher might use descriptive statistics to iden-
the researcher has created an account that seems to tify and represent these patterns.
capture and make sense of all the data at hand. Interpretation refers to making sense of what
these patterns or themes might mean, developing
explanations, and making connections between the
Naturalistic Research Methods
data and relevant studies or theoretical frameworks.
General Process For example, reasoning by analogy, researchers
might note parallels between athletic events and
When naturalistic researchers conduct field
anthropological descriptions of ritual processes.
research, they typically go through the following
Naturalistic researchers draw on their own under-
common sequence of steps:
standing of social, psychological, and economic the-
1. Gaining access to and entering the field site ory as they formulate accounts of their findings.
They work inductively, from the ground up, and
2. Gathering data eventually develop location-specific theories or
3. Ensuring accuracy and trustworthiness accounts based on analysis of primary data.
(verifying and cross-checking findings) As a by-product of this process, new research
questions emerge. Whereas traditional researchers
4. Analyzing data (begins almost immediately and
continues throughout the study) establish hypotheses prior to the start of their
studies, qualitative researchers formulate broad
5. Formulating interpretations (also an ongoing research questions or problem statements at the
process) start, then reformulate or develop new questions
6. Writing up findings as the study proceeds. The terms grounded theory,
inductive analysis, and content analysis, although
7. Member checking (sharing conclusions and
conferring with participants) not synonymous, refer to this process of making
sense of and interpreting data.
8. Leaving the field site
Evaluating Quality
Sampling
The standards used to evaluate the adequacy of tra-
Naturalistic researchers employ purposive rather ditional, quantitative studies should not be used to
than representative or random sampling methods. assess naturalistic research projects. Quantitative
882 Naturalistic Inquiry
and qualitative researchers work within distinct language to construct their representations of the
traditions that rest on different philosophical ass- social world. A third form of reflexivity examines
umptions, employ different methods, and produce how participants and the researchers who study
different products. Qualitative researchers argue them create social order through practical, goal-
among themselves about how best to evaluate nat- oriented actions and discourse.
uralistic inquiry projects, and there is little consen-
sus on whether it is possible or appropriate to Comprehensiveness and Scope
establish common standards by which such studies
might be judged. However, many characteristics The cultural anthropologist Clifford Geertz
are widely considered to be indicators of merit in used the term thick description to convey the level
the design of naturalistic inquiry projects. of rich detail typical of qualitative, ethnographic
descriptions. When writing qualitative research
Immersion reports, researchers place the study site and find-
ings as a whole within societal and cultural con-
Good qualitative studies are time consuming. texts. Effective reports also incorporate multiple
Researchers must become well acquainted with the perspectives, including perspectives of participants
field site and its inhabitants as well as the wider from all walks of life (for example) within a single
context within which the site is located. They also community or organization.
immerse themselves in the data analysis process,
through which they read, review, and summarize
Accuracy
their data.
Researchers are expected to describe the steps
Transparency and Rigor taken to verify findings and interpretations. Strate-
gies for verification include triangulation (using
When writing up qualitative research projects, and confirming congruence among multiple sour-
researchers must put themselves in the text, ces of information), member checking (negotiating
describing how the work was conducted, how they conclusions with participants), and auditing (criti-
interacted with participants, how and why they cal review of the research design, processes, and
decided to proceed as they did, and noting how conclusions by an expert).
participants might have been affected by these
interactions. Whether the focus is on interview
Claims and Warrants
transcripts, visual materials, or field research notes,
the analytical process requires meticulous attention In well-designed studies, naturalistic researchers
to detail and an inductive, bottom-up process of ensure that their conclusions are supported by
reasoning that should be made clear to the reader. empirical evidence. Furthermore, they recognize
that their conclusions follow logically from the
Reflexivity design of the study, including the review of perti-
nent literature, data collection, analysis, interpreta-
Naturalistic inquirers do not seek to attain tion, and the researcher’s inferential process.
objectivity, but they must find ways to articulate
and manage their subjective experiences. Evidence
of one or more forms of reflexivity is expected in Attention to Ethics
naturalistic inquiry projects. Positional reflexivity Researchers should describe the steps taken to
calls on researchers to attend to their personal protect participants from harm and discuss any
experiences—past and present—and describe how ethical issues that arose during the course of the
their own personal characteristics (power, gender, study.
ethnicity, and other intangibles) played a part in
their interactions with and understandings of parti-
Fair Return
cipants. Textual reflexivity involves skeptical, self-
critical consideration of how authors (and the pro- Naturalistic inquiry projects are time consuming
fessional communities in which they work) employ not only for researchers but also for participants,
Naturalistic Inquiry 883
who teach researchers about their ways of life and through separation of the researcher from partici-
share their perspectives as interviewees. Research- pants and by dispassionate analysis and interpreta-
ers should describe what steps they took to com- tion of results. In contrast, naturalistic researchers
pensate or provide fair return to participants for tap into their own subjective experiences as
their help. Research leads to concrete benefits for a source of data, seeking experiences that will
researchers (degree completion or career advance- afford them an intuitive understanding of social
ment). Researchers must examine what benefits phenomena through empathy and subjectivity.
participants will gain as a result of the work and Qualitative researchers use their subjective experi-
design their studies to ensure reciprocity (balanced ences as a source of data to be carefully described,
rewards). analyzed, and shared with those who read their
research reports.
For the naturalistic inquirer, objectivity and
Coherence
detachment are neither possible nor desirable.
Good studies call for well-written and compel- Human experiences are invariably influenced by
ling research reports. Standards for writing are the methods used to study them. The process of
genre specific. Postmodern authors defy tradition being studied affects all humans who become
through experimentation and deliberate violations subjects of scientific attention. The presence of
of writing conventions. For example, some authors an observer affects those observed. Furthermore,
avoid writing in clear, straightforward prose to the observer is changed through engaging with
express more accurately the complexities inherent and observing the other. Objectivity is always
in the social world and within the representational a matter of degree.
process. Qualitative researchers are typically far less con-
cerned about objectivity as this term is understood
within traditional research approaches than with
Veracity
intersubjectivity. Intersubjectivity is the process by
A good qualitative report brings the setting and which humans share common experiences and sub-
its residents to life. Readers who have worked or scribe to shared understandings of reality. Natural-
lived in similar settings find the report credible istic researchers seek involvement and engagement
because it reflects aspects of their own experiences. rather than detachment and distance. They believe
that humans are not rational beings and cannot be
understood adequately through objective, disembo-
Illumination
died analysis. Authors critically examine how their
Good naturalistic studies go beyond mere theoretical assumptions, personal histories, and
description to offer new insights into social and methodological decisions might have influenced
psychological phenomena. Readers should learn findings and interpretations (positional reflexivity).
something new and important about the social In a related vein, naturalistic researchers do not
world and the people studied, and they might also believe that political neutrality is possible or help-
gain a deeper understanding of their own ways ful. Within some qualitative research traditions,
of life. researchers collaborate with participants to bring
about community-based political and economic
change (social justice).
Philosophical Foundations
Qualitative researchers reject determinism, the
Traditional scientific methods rest on philosophical idea that human behaviors are lawful and can be
assumptions associated with logical positivism. predicted. Traditional scientists try to discover
When working within this framework, researchers relationships among variables that remain consis-
formulate hypotheses that are drawn from estab- tent across individuals beyond the experimental
lished theoretical frameworks, define variables by setting. Naturalistic inquiry rests on the belief that
stipulating the processes used to measure them, studying humans requires different methods than
collect data to test their hypotheses, and report those used to study the material world. Advocates
their findings objectively. Objectivity is attained emphasize that no shared, universal reality remains
884 Naturalistic Inquiry
constant over time and across cultural groups. The authors also translated key concepts across what
phenomena of most interest to naturalistic research- they thought were profoundly different paradigms
ers are socially constructed, constantly changing, (disciplinary worldviews). In recent years, quali-
and multiple. Naturalistic researchers hold that all tative researchers considered the implications of
human phenomena occur within particular contexts critical, feminist, postmodern, and poststructural
and cannot be interpreted or understood apart from theories for their enterprise. The recognition or
these contexts. rediscovery that researchers create the phenomena
they study and that language plays an important
part in this process has inspired methodological
History
innovations and lively discussions. The discourse on
The principles that guide naturalistic research naturalistic inquiry remains complex and ever
methods were developed in biology, anthropology, changing. New issues and controversies emerge
and sociology. Biologist Charles Darwin developed every year, reflecting philosophical debates within
the natural history method, which employs and across many academic fields.
detailed observation of the natural world directed
by specific research questions and theory building
based on analysis of patterns in the data, followed Methodological Disadvantages
by confirmation (testing) with additional observa- and Advantages
tions in the field. Qualitative researchers use
Disadvantages
similar strategies, which transform experiential,
qualitative information gathered in the field into Many areas are not suited to naturalistic
data amenable to systematic investigation, analy- investigation. Naturalistic research designs cannot
sis, and theory development. uncover cause and effect relationships and they
Ancient adventurers, writers, and missionaries cannot help researchers evaluate the effectiveness
wrote the first naturalistic accounts, describing the of specific medical treatments, school curricula, or
exotic people they encountered on their travels. Dur- parenting styles. They do not allow researchers to
ing the early decades of the 20th century, cultural measure particular attributes (motivation, reading
anthropologists and sociologists pioneered the use ability, or test anxiety) or to predict the outcomes
of ethnographic research methods for the scientific of interventions with any degree of precision.
study of social phenomena. Ethnography is both Qualitative research permits only claims about the
a naturalistic research methodology and a written specific case under study. Generalizations beyond
report that describes field study findings. Although the research site are not appropriate. Furthermore,
there are many different ethnographic genres, all of naturalistic researchers cannot set up logical condi-
them employ direct observation of naturally occur- tions whereby they can demonstrate their own
ring events in the field. Early in the 20th century, assumptions to be false.
University of Chicago sociologists used ethno- Naturalistic inquiry is time consuming and
graphic methods to study urban life, producing pio- difficult. Qualitative methods might seem to be
neering studies of immigrants, crime, work, youth, easier to use than traditional experimental and
and group relations. Sociologist Herbert Blumer, survey methods because they do not require mas-
drawing on George Herbert Mead, William I. tery of technical statistical and analytical meth-
Thomas, and John Dewey, developed a rationale for ods. However, naturalistic inquiry is one of the
the naturalistic study of the social world. In the most challenging research approaches to learn
1970s, social scientists articulated ideas and theoret- and employ. Qualitative researchers tailor meth-
ical issues pertinent to naturalistic inquiry. Interest ods to suit each project, revising data-collection
in qualitative research methods grew. In the mid- strategies as questions and research foci emerge.
1980s, Yvonne Lincoln and Egon Guba published Naturalistic researchers must have a high toler-
Naturalistic Inquiry, which provided a detailed cri- ance for uncertainty and the ability to work
tique of positivism and examined implications for independently for extended periods of time, and
social research. Highlighting the features that set these researchers must also be able to think crea-
qualitative research apart from other methods, these tively under pressure.
Naturalistic Observation 885
understanding of communication among animal the 1920s and 1930s. The approach became
species through naturalistic observation, introduc- widely adopted among anthropologists during
ing such terminology as imprinting, fixed action these same two decades. In Mead’s 1928 study
pattern, sign stimulus, and releaser to the scientific ‘‘Coming of Age in Samoa,’’ data were collected
lexicon. All the investigations mentioned here were while she resided among the inhabitants of a small
notable for their strong ecological validity, as they Samoan village, making possible her groundbreak-
were conducted within a context reflective of the ing revelations on the lives of girls and women in
normal life experiences of the subjects. It is highly this island society.
doubtful that the same richness of content could Over the years, naturalistic observation became
have been obtained in an artificial environment a widely used technique throughout the many sci-
devoid of concurrent factors that would have nor- entific disciplines concerned with human behavior.
mally accompanied the observed behaviors. Among its best-known practitioners was Jean
The instances in which naturalistic observation Piaget, who based his theory of cognitive devel-
also yields valuable insight to psychologists, social opment on observations of his own children
scientists, anthropologists, ethnographers, and throughout the various stages of their maturation;
behavioral scientists in the study of human behav- in addition, he would watch other children at play,
ior are many. For example, social deficits symp- listening to and recording their interactions. Jer-
tomatic of certain psychological or developmental emy Tunstall conducted a study of fishermen in the
disorders (such as autism, childhood aggression, or English seaport of Hull by living among them and
anxiety) might be evidenced more clearly in a typi- working beside them, a sojourn that led to the
cal context than under simulated conditions. The publication of his book Fishermen: The Sociology
dynamics within a marital or family relationship of an Extreme Occupation in 1962. Stanley Mil-
likewise tend to be most perceptible when the gram employed naturalistic observation in an
participants interact as they would under everyday investigation on the phenomenon of ‘‘familiar
circumstances. In the study of broader cultural strangers’’ (people who encountered but never
phenomena, a researcher might collect data by spoke to one another) among city dwellers, watch-
living among the population of interest and wit- ing railway commuters day after day as they
ness-ing activities that could only be observed in waited to board the train to their workplaces in
a real-life situation after earning their trust and New York City. At his Family Research Labora-
their acceptance as an ‘‘insider.’’ tory in Seattle, Washington, John Gottman has
This entry begins with the historic origins of used audiovisual monitoring as a component of his
naturalistic observation. Next, the four types of marriage counseling program since 1986. Couples
naturalistic observation are described and natural- stay overnight in a fabricated apartment at the lab-
istic observation and experimental methods are oratory, and both qualitative data (such as verbal
compared. Last, this entry briefly discusses the interactions, proxemics, and kinesics) and quanti-
future direction of naturalistic observation. tative data (such as heart rate, pulse amplitude,
and skin conductivity) are collected. In his book
The Seven Principles for Making Marriage Work,
Historic Origins
Gottman reported that the evidence gathered dur-
The field of qualitative research gained prominence ing this phase of therapy enabled him to predict
in the United States during the early 20th century. whether a marriage would fail or succeed with
Its emergence as a recognized method of scientific 91% accuracy.
investigation was taking place simultaneously in
Europe, although the literature generated by many
Types of Naturalistic Observation
of these proponents was not available in the West-
ern Hemisphere until after World War II. At the Naturalistic observation might be divided into four
University of Chicago, such eminent researchers as distinct categories. Each differs from the others in
Robert Park, John Dewey, Margaret Mead, and terms of basic definitions, distinguishing features,
Charles Cooley contributed greatly to the develop- strengths and limitations, and appropriateness for
ment of participant observation methodology in specific research designs.
Naturalistic Observation 887
various tools and instruments openly, thus enabling the experimental method, and also more flexibility
easier and more complete recording of observa- is involved in accommodating change throughout
tions. (In contrast, a covert observer might be the research process. These attributes make for
forced to write hasty notes on pieces of paper to an ideal preliminary procedure, one that might
avoid suspicion and to attempt reconstruction of serve to lay the groundwork for a more focused
fine details from memory later on.) Still, artifacts investigation. As mentioned earlier, unexpected
associated with awareness of the investigator’s observations might generate new hypotheses,
presence might persist, even though observation thereby contributing to the comprehensiveness
from a distance might tend to exert less influence of any research based thereon.
on subjects’ behavior. In addition, there is virtually By remaining unobtrusive, the observer has
no opportunity to question subjects should the access to behaviors that are more characteristic,
researcher wish to obtain subsequent clarification more spontaneous, and more diverse that those
of the meaning attached to an event. The observer one might witness in a laboratory setting. In many
might, thus, commit the error of making subjective instances, such events simply cannot be examined
interpretations based on inconclusive evidence. in a laboratory setting. To learn about the natural
behavior of a wild animal species, the workplace
Covert Nonparticipant Observation dynamics of a corporate entity, or the culturally
prescribed roles within an isolated society, the
This procedure involves observation conducted investigator must conduct observations in the sub-
apart from the subjects being studied. As in covert jects’ day-to-day environment. This requirement
participant observation, the identity of the investi- ensures a greater degree of ecological validity
gator is not revealed. Data are often secretly than one could expect to achieve in a simulated
recorded and hidden; alternatively, observations environment. However, there are no implications
might be documented at a later time when the for increased external validity. As subjects are
investigator is away from the subjects. Witnessing observed by happenstance, not selected according
events by means of electronic devices is also a form to a sampling procedure, representativeness cannot
of covert nonparticipant observation. For example, be guaranteed. Any conclusions drawn must neces-
the researcher might watch a videotape of children sarily be limited to the sample studied and cannot
at recess to observe peer aggression. generalize to the population.
The covert nonparticipant observer enjoys the There are other drawbacks to naturalistic obser-
advantages of candid subject behavior as well as vation vis-à-vis experimental methods. One of
the availability of apparatus with which to record these is the inability to control the environment in
data immediately. However, as in covert participant which subjects are being observed. Consequently,
observation, measures taken to preserve anonymity the experimenter can derive descriptive data from
might also curtail access to the full range of obser- observation but cannot establish cause-and-effect
vations. Remote surveillance might similarly offer relationships. Not only does this preclude explana-
only a limited glimpse of the sphere of contextual tion of why behaviors occur, but also it limits the
factors, thereby diminishing the usefulness of the prediction of behaviors. Additionally, the natural
data. Finally, the previously discussed ethical conditions observed are unique in all instances,
infractions associated with any form of covert thus rendering replication unfeasible.
observation, as well as the potential legal repercus- The potential for experimenter bias is also signifi-
sions, make using this method highly controversial. cant. Whereas the number of times a behavior is
recorded and the duration of the episode are both
Comparing Naturalistic Observation With unambiguous measures, the naturalistic observer
lacks a clear-cut system for measuring the extent or
Experimental Methods
magnitude of a behavior. Perception of events might
The advantages offered by naturalistic observa- thus be influenced by any number of factors, includ-
tion are many, whether in conjunction with experi- ing personal worldview. An especially problematic
mental research or as the primary constituent of situation might arise when the observer is informed
a study. First, there is less formal planning than in of the hypothesis and of the conditions under
Naturalistic Observation 889
investigation, as this might lead to seeking confirma- positive for the virus. Similarly, only preexisting
tory evidence. Another possible error is that of the psychiatric conditions (such as posttraumatic stress
observer recording data in an interpretative rather disorder) are studied, as subjects cannot be expo-
than a descriptive manner, which can result in an ex sed to manipulations that could cause psychologi-
post facto conclusion of causality. The researcher’s cal or emotional trauma. Certain factors might be
involvement with the group in participant observa- difficult or impossible to measure, as is the case
tion might constitute an additional source of bias. with various cognitive processes.
Objectivity can suffer because of group influence, The researcher’s choice of method can either
and data might also be colored by a strong positive contribute to or, conversely, erode scientific rigor. If
or negative impression of the subjects. a convenience sample is used, if there are too few
Experimental approaches to research differ from subjects in the sample, if randomization is flawed,
naturalistic observation on a number of salient or if the sample is otherwise not representative of
points. One primary advantage of the true experi- the population from which it is selected, then the
ment is that a hypothesis can be tested and a cause- study will not yield generalizable results. The use
and-effect relationship can be demonstrated. The of an instrument with insufficient reliability and
independent variable of interest is systematically validity might similarly undermine the experimen-
manipulated, and the effects of this manipulation tal design. Nonetheless, bias and human error are
on the dependent variable are observed. Because universal in all areas of research. Self-awareness,
the researcher controls the environment in which critical thinking, and meticulous research methods
the study is conducted, it is thus possible to elimi- can do much to minimize their ill effects.
nate confounding variables. Besides enabling attri-
bution of causality, this design also provides
evidence of why a behavior occurs and allows pre- Future Directions
diction of when and under what conditions the Robert Elliott and his colleagues proposed new
behavior is likely to occur again. Unlike naturalistic guidelines for the publication of qualitative
observation, an experimental study can possess research studies in 1999, with the goal of encourag-
internal and external validity, although the controls ing legitimization, quality control, and subsequent
inherent in this approach can diminish ecological development of this approach. Their expoition of
validity, as it might be difficult to eliminate extra- both the traditional value and the current evolution
neous variables while maintaining some semblance of qualitative research was a compelling argument
of a real-world setting. in support of its function not only as a precursor to
An additional benefit of experimental research is experimental investigations but also as a method
the relative stability of the environment in which the that addressed a different category of questions and
researcher conducts the study. In contrast, partici- therefore merited recognition in its own right.
pant observation might entail a high degree of stress Given the ongoing presence of nonexperimental
and personal risk when working with certain groups approaches in college and university curricula and
(such as gang members or prison inmates). This in the current literature, it is likely that naturalistic
method also demands investment of considerable observation will continue to play a vital role in
time and expense, and the setting might not be con- scientific research.
ducive to management of other responsibilities.
Although experimental design is regarded as Barbara M. Wells
a more conclusive method than naturalistic obser-
vation and is more widely used in science, it is not See also Descriptive Statistics; Ecological Validity;
suitable for all research. Ethical and legal guide- Naturalistic Inquiry; Observational Research;
lines might forbid an experimental treatment if it Observations; Qualitative Research
is judged capable of harming subjects. For exam-
ple, in studying the progression of a viral infection
such as HIV, the investigator is prohibited from Further Readings
causing subjects to contract the illness and Davidson, B., Worrall, L., & Hickson, L. (2003).
must instead recruit those who have already tested Identifying the communication activities of older
890 Nested Factor Design
people with aphasia: Evidence from naturalistic treatments when therapists or treatment centers
observation. Aphasiology, 17; 243–264. provide one treatment to more than one partici-
Education Forum. Primary research methods. Retrieved pant. Because each therapist or treatment center
August 24, 2008, from http://www.educationforum. provides only one treatment, the provider or treat-
co.uk/Health/primarymethods.ppt
ment center factor is nested under only one level of
Elliott, R., Fisher, C. T., & Rennie, D. L. (1999).
Evolving guidelines for publication of qualitative
the treatment factor. Nested factor designs are also
research studies in psychology and related fields. common in educational research in which class-
British Journal of Clinical Psychology, 38; 215–229. rooms of students are nested within classroom
Fernald, D. (1999). Research methods. In Psychology. interventions. For example, researchers commonly
Upper Saddle River, NJ: Prentice Hall. assign whole classrooms to different levels of
Mehl, R. (2007). Eavesdropping on health: A naturalistic a classroom-intervention factor. Thus, each class-
observation approach for social health research. Social room, or cluster, is assigned to only one level of the
and Personality Psychology Compass, 1; 359–380. intervention factor and is said to be nested under
Messer, S. C., & Gross, A. M. (1995). Childhood
this factor. Ignoring a nested factor in the evalua-
depression and family interaction: A naturalistic
tion of a design can lead to consequences that are
observation study. Journal of Clinical Child
Psychology, 24; 77–88. detrimental to the validity of statistical decisions.
Pepler, D. J., & Craig, W. M. (1995). A peek behind the The main reason for this is that the observations
fence: Naturalistic observations of aggressive children within the levels of a nested factor are likely to not
with remote audiovisual recording. Developmental be independent of each other but related. The mag-
Psychology, 31; 548–553. nitude of this relationship can be expressed by
Spata, A. V. (2003). Research methods: Science and a so-called intraclass correlation coefficient ρI .
diversity. New York: Wiley. The focus of this entry is on the most common
Weinrott, M. R., & Jones, R. R. (1984). Overt versus nested design: the two-level nested design. This entry
covert assessment of observer reliability. Child
discusses whether nested factors are random or fixed
Development, 55; 1125–1137.
effects and the implications of nested designs on sta-
tistical power. In addition, the criteria to determine
which model to use and the consequences of ignor-
ing nested factors are also examined.
NESTED FACTOR DESIGN
Two-Level Nested Factor Design
In nested factor design, two or more factors are
not completely crossed; that is, the design does not The most common nested design involves two fac-
include each possible combination of the levels of tors with a factor B nested within the levels of
the factors. Rather, one or more factors are nested a second factor A. The linear structural model for
within the levels of another factor. For example, in this design can be given as follows:
a design in which a factor (factor B) has four levels
and is nested within the two levels of a second fac- Yijk ¼ μ þ αj þ βkðjÞ þ εijk ; ð1Þ
tor (factor A), levels 1 and 2 of factor B would only
occur in combination with level 1 of factor A and where Yijk is the observation for the ith subject
levels 3 and 4 of factor B would only be combined (i ¼ 1; 2; . . . ; nÞ in the jth level of factor A (j ¼ 1;
with level 2 of factor A. In other words, in a nested 2; . . . ; pÞ and the kth level of the nested factor
factor design, there are cells that are empty. In the B (k ¼ 1; 2; . . . ; qÞ; μ is the grand mean, αj is the
described design, for example, no observations are effect for the jth treatment, βkðjÞ is the effect of the
made for the combination of level 1 of factor A kth provider nested under the jth treatment, and εijk
and level 3 of factor B. When a factor B is nested is the error of the observation (within cell variance).
under a factor A, this is denoted as B(A). In more Note that because factors A and B are not completely
complex designs, a factor can also be nested under crossed, the model does not include an interaction
combinations of other factors. A common example term because it cannot be estimated separately from
in which factors are nested is within treatments, the error term. More generally speaking, because
for example, the evaluation of psychological nested factor designs have not as many cells as
Nested Factor Design 891
completely crossed designs, one cannot perform all generalize to the levels of the factor that are not
tests for main effects and interactions. included in the study (the universe of levels),
The assumptions of the nested model are that then a nested factor should be treated as a random
the effects of the fixed factor A sum up to zero factor. The assumption that the levels included in
X the study are representative of an underlying
αj ¼ 0; ð2Þ population requires that the levels are drawn at
j random from the universe of levels. Thus, nested
factor levels are treated like subjects who are also
that errors are normally distributed and have an
considered to be random samples from the popula-
expected value of zero
tion of all subjects. The resulting model is called
a mixed model and assumes that the effects of fac-
εijk ∼ N 0; σ 2ε ; and that ð3Þ
iðjkÞ tor B are normally distributed with a mean of zero,
specifically:
εiðjkÞ ; αj ; and βkðjÞ are pairwise independent: ð4Þ
βkðjÞ ∼ N 0; σ 2β : ð6Þ
kðjÞ
In the next section, the focus is on nested factor
designs with two factors with one of the factors A nested factor is correctly conceptualized as
nested under the other. More complex models a fixed factor if a researcher only seeks to make
can be built analogously. For example, the model an inference about the specific levels of the
equation for a design with two crossed factors, A nested factor included in his or her study and if
and B, and a third factor nested within factor C is the levels included in the study are not drawn at
described by the following structural model: random from a universe of levels. For example,
if a researcher wants to make an inference about
Yijkm ¼ μ þ αi þ βj þ γ kðiÞ þ ðαβÞjkðiÞ þεijkm : ð5Þ the specific treatment centers (nested within dif-
ferent treatments) that are included in her study,
Nested Factors as Random and Fixed Effects the nested factor is correctly modeled as a fixed
effect. The corresponding assumption of the
In experimental and quasi-experimental designs, fixed model is
the factor under which the second factor is nested X
is almost always conceptualized as a fixed factor. βkðjÞ ¼ 0; ð7Þ
That is, a researcher seeks to make inferences kðjÞ
treatment effect can be expressed as the partial The main difference between the mixed and
effect size the fixed model is that in the mixed model, the
expected mean square for factor A contains a term
σ 2A that includes the variance caused by the nested fac-
ω2part ¼ ; ð10Þ
σ 2A þ σ 2within tor B (viz., nσ 2B ), whereas in the fixed model, the
expected mean square for treatment effects con-
that is, the variance caused by factor A divided by tains no such term. Consequently, in the mixed-
the sum of the variance caused by the treatments model case, the correct denominator to calculate
and within-cell variance. The partial effect size the test statistic for factor A is the mean square for
reflects the effectiveness of the factor A indepen- the nested factor, namely
dent of additional effects of the nested factor B.
The effect resulting from the nested factor can be MSA
analogously defined as partial effect size, as follows: Fmixed ¼ : ð12Þ
MSB
σ 2B
ω2B ¼ ¼ ρI : ð11Þ Note that the degrees of freedom of the denomi-
σ 2B þ σ 2within nator are exclusively a function of the number of
levels of the factors A and B and are not influenced
If the nested factor is modeled as random, this
by the number of subjects within each cell of the
effect is equal to the intraclass correlation coeffi-
design.
cient ρI . This means that intraclass correlations ρI
In the fixed-model case, the correct denominator
are partial effect sizes of the nested factor B (i.e.,
is the mean square for within cell variation, namely
independent of the effects of factor A). The intra-
class correlation coefficient, which represents the MSA
relative amount of variation attributable to the Ffixed ¼ : ð13Þ
MSwithin
nested factor, is also a measure of the similarity of
the observations within the levels of the nested fac- Note that the within-cell variation does not
tors. It is, therefore, a measure of the degree to include the variation resulting from the nested factor
which the assumption of independence—required and that the degrees of freedom of the denominator
if the nested factor is ignored in the analysis and are largely determined by the number of subjects.
individual observations are the unit of analysis—is The different ways the tests statistics for the
violated. Ignoring the nested factor if the intraclass non-nested factor A are calculated reflect the dif-
correlation is not zero can lead to serious prob- ferent underlying model assumptions. In the mixed
lems, especially to alpha inflation. model, levels of the nested factor are treated as
a random sample from an underlying universe of
levels. Because variation caused by the levels of the
Sample Statistics
nested factor sampled in a particular study will
The source tables for the mixed model (factor A randomly vary across repetitions of a study, this
fixed and factor B random) and the fixed model variation is considered to be error. In the fixed
(both factors fixed) are presented in Table 1. model, it is assumed that the levels of the nested
Table 1 Sources of Variance and Expected Mean Squares for Nested Design: Factor A Fixed and Nested Factor B
Random (Mixed Model) Versus Factor A Fixed and Nested Factor B Fixed (Fixed Model)
Source SS df MS E(MS) Mixed Model E(MS) Fixed Model
SSA 2 2 2
A SSA p1 p1 σ w þ nσ B þ ½p=ðp 1Þnqσ A σ 2w þ [p/(p 1)]nqσ 2A
SSB
B(A) SSB p(q 1) pðq1Þ σ 2w þ nσ 2B σ 2w þ [q/(q 1)]nσ 2B
SSwithin
Within cell (w) SSw pq(n 1) pqðn1Þ σ 2w σ 2w
Notes: The number of levels of factor A is represented by p; the number of levels of the nested factor B within each level of factor
A is represented by q; and the number of subjects within each level of factor B is represented by n: df ¼ degrees of freedom; SS ¼
sum of squares; MS ¼ mean square; E(MS) ¼ expected mean square; A ¼ factor A; B ¼ nested factor B; w ¼ within cell.
Nested Factor Design 893
factor included in a particular study will not vary As a result, the population effect size of factor
across replications of a study, and variation from A can be estimated by the following formula for
the nested factor is removed from the estimated the nonpartial effect size:
error.
SSA ðp 1ÞMSB
^ 2mixed ¼
ω : ð21Þ
SSA ðp 1ÞMSB
Effect Size Estimates in Nested Factor Designs þ pqðMSB MSwithin Þ
The mixed and the fixed models vary with þ pqnMSwithin
respect to how population effect sizes are esti-
mated. First, population effects are typically not Accordingly, the partial effect size for factor A
estimated by the sample effect size, namely can be estimated with the formula
is modeled correctly as a random or as a fixed fac- that nested factors should be treated as random
tor. In the mixed-effects model, statistical power effects by default. Nested factors should also be
mainly depends on the number of levels of the treated as random if the levels of the nested factors
nested factor, whereas power is largely indepen- are randomly assigned to the levels of the non-
dent of the number of subjects within each level of nested factor. In the absence of random sampling
the nested factor. In fact, a mixed-model ANOVA from a population, random assignment can be
with, for instance, a nested factor with two levels used as a basis of statistical inference. Under the
nested within the levels of a higher order factor random assignment model, the statistical inference
with two levels for each treatment essentially has can be interpreted as applying to possible reran-
the statistical power of a t test with two degrees of domizations of the subjects in the sample.
freedom. In the mixed model, power is also nega- If a researcher seeks to make an inference about
tively related to the magnitude of the effect of the the specific levels of the nested factors included in
nested factor. Studies with random nested factors the study, a fixed-effects model should be used.
should be designed accordingly with a sufficient Any (statistical) inference made on the basis of the
number of levels of the nested factor, especially if fixed model is restricted to the specific levels of the
a researcher expects large effects of the nested fac- nested factor as they were realized in the study.
tor (i.e., a large intraclass correlation). The question of which model should be used
In the fixed model, statistical power is mainly in the absence of random sampling and random
determined by the number of subjects and remains assignment is debatable. Some authors argue that
largely unaffected by the number of levels of the the mixed model should be used regardless of
nested factor. Moreover, the power of the fixed- whether random sampling or random assignment
model test increases with increasing nested factor is involved. Other authors argue that in this case,
effects because the fixed-effects model residualizes a mixed-effects model is not justified and the fixed-
the F-test denominator (expected within subject effects model should be used, with an explicit
variance) for nested factor variance. acknowledgement that it does not allow a general-
ization of the obtained results. The choice between
the mixed and the fixed model is less critical if the
Criteria to Determine the Correct Model
effects of the nested factor are zero. In this case,
Two principal possibilities exist for dealing with the mixed and the fixed model reach the same con-
nested effects: Nested factors can be treated clusions when the null hypothesis is true even if the
as random factors leading to a mixed-model mixed model is assumed to be a valid statistical
ANOVA, or nested factors might be treated as model for the study. In particular, the mixed model
fixed factors leading to a fixed model ANOVA. does not lead to inflated Type I error levels. The
There are potential risks associated with choos- fixed-effects analysis, however, can have dramati-
ing the incorrect model in nested factor designs. cally greater power when the alternative hypothesis
The incorrect use of the fixed model might lead is true. It has to be emphasized, however, that any
to overestimations of effect sizes and inflated choice between the mixed and the fixed model
Type I error rates. In contrast, the incorrect use should not be guided by statistical power consid-
of the mixed model might lead to serious under- erations alone.
estimations of effect sizes and inflated Type II
errors (lack of power). It is, therefore, important
Consequences of Ignoring Nested Factors
to choose the correct model to analyze a nested-
factor design. Although the choice between the two different
If the levels of a nested factor have been ran- models to analyze nested factor designs may be
domly sampled from a universe of population difficult, ignoring the nested factor is always
levels and the goal of a researcher is to generalize a wrong decision. If the mixed model is the correct
to this universe of levels, the mixed model has model and there are nested factor effects (i.e., the
to be used. Because the generalization of results is intraclass correlation is different from zero), then
commonly recognized as an important aim of sta- ignoring a nested factor, and thus the dependence
tistical hypothesis testing, many authors emphasize of observations within the subjects within the
Network Analysis 895
levels of the nested factor, leads to inflated Type I the relationships between these factors. These rela-
error rates and an overestimation of population tionships are illustrated in a diagrammatic net-
effects. Some authors have suggested that after work consisting of nodes (i.e., causal factors) and
a preliminary test (with a liberal alpha level) shows arcs representing the relationships between nodes.
that there are no significant nested factor effects, it The technique captures the complexities of peo-
is safe to remove the nested factor from the analy- ple’s cognitive representations of causal attribu-
sis. Monte-Carlo studies have shown, however, tions for a given phenomenon. This entry discusses
that these preliminary tests are typically not pow- the history, techniques, applications, and limita-
erful enough (even with a liberal alpha level) to tions of network analysis.
detect meaningful nested-factor effects.
However, if the fixed-effects model correctly
History
describes the data, ignoring the nested factor will
lead to an increase in Type II error levels (i.e., Network analysis was developed to account for
a loss in statistical power) and an underestima- individuals’ relatively complex and sophisticated
tion of population effects. Both tendencies are explanations of human behavior. It is underpinned
positively related to the magnitude of the nested by the notion of a perceived causal structure,
factor effect. which Harold Kelly described as being implicit in
the cognitive representation of attributions. The
Matthias Siemer perceived causal structure constitutes a temporally
ordered network of interconnected causes and
See also Cluster Sampling; Fixed-Effects Models;
effects. Properties of the structure include the fol-
Hierarchical Linear Modeling; Intraclass Correlation;
lowing: direction (past–future), extent (proximal–
Mixed-and Random-Effects Models; Multilevel
distal), patterning (simple–complex), components
Modeling; Random-Effects Models
of varying stability–instability, and features rang-
ing from actual to potential. The structure
Further Readings produced might be sparse or dense in nature,
Maxwell, S. E., & Delaney, H. D. (2004). Designing depending on the number of causal factors identi-
experiments and analyzing data. A model comparison fied. Network analysis comprises a group of tech-
perspective (2nd ed.). Mahwah, NJ: Lawrence niques developed in sociology and social
Erlbaum. anthropology, and it provides a method for gener-
Siemer, M., & Joormann, J. (2003). Power and measures ating and analyzing perceived causal networks,
of effect size in analysis of variance with fixed versus their structural properties, and the complex chains
random nested factors. Psychological Methods, 8, of relationships between causes and effects.
497–517.
Wampold, B. E., & Serlin, R. C. (2000). The consequence
of ignoring a nested factor on measures of effect size Network Analysis Techniques
in analysis of variance. Psychological Methods, 5;
425–433. Network analysis can be conducted using semi-
Zucker, D. M. (1990). An analysis of variance pitfall: structured interviews, diagram methods, and mat-
The fixed effects analysis in a nested design. rix methods. Although interviews provide detailed
Educational and Psychological Measurement, 50, individual networks, difficulties arise in that indi-
731–738. vidual structures cannot be combined, and causal
structures of different groups cannot be compared.
The diagram method involves either the spatial
arrangement of cards containing putative causes or
NETWORK ANALYSIS the participant directly drawing the structure.
Participants can both choose from a given set of
Network analysis elicits and models perceptions potential causal factors and incorporate other
of the causes of a phenomenon. Typically, respon- personally relevant factors into their network.
dents are provided with a set of putative causal In addition, the strength of causal paths can be
factors for a focal event and are asked to consider rated. Although these methods have the virtue of
896 Network Analysis
ensuring only the most important causal links investigation of all possible links, and as it does
are elicited, they might potentially oversimplify not rely on participants’ recall, it would be
respondents’ belief structures, often revealing only expec-ted to produce more reliable results.
sparse networks.
The matrix technique employs an adjacency
Applications
grid with the causes of a focal event presented
vertically and horizontally along its top and side. Network analysis has been applied to diverse areas
Participants rate the causal relationship for every to analyze belief structures. Domains that have
pairwise combination. Early studies used a binary been examined include lay understandings of
scale to indicate the presence/absence of causal social issues (e.g., loneliness, poverty), politics
links; however, this method does not reveal the (e.g., the 2nd Iraq war, September 11th), and more
strength of the causal links. Consequently, recent recently illness attributions for health problems
studies have used Likert scales whereby partici- (e.g., work-based stress, coronary heart disease,
pants rate the strength of each causal relationship. lower back pain, and obesity).
A criterion is applied to these ratings to establish The hypothetical network (Figure 1) illustrates
which of the resulting causal links should be some properties of network analysis. For example,
regarded as consensually endorsed and, therefore, the illness is believed to have three causes: stress,
contributing to the network. smoking, and family history. Both stress and smok-
Early studies adopted a minimum systems crite- ing are proximal causes, whereas family history is
rion (MSC), the value at which all causes are a more distal cause. In addition to a direct effect
included in the system, to determine the network of stress, the network shows a belief that stress
nodes. Accordingly, causal links are added hierar- also has an indirect effect on illness, as stress
chically to the network, in the order of mean causes smoking. Finally, there is a reciprocal rela-
strength, until the MSC is reached. It is generally tionship (bidirectional arrow) between the illness
accompanied by the cause-to-link ratio, which is and stress, such that stress causes the illness, and
the ratio of the number of extra links required to having the illness causes stress.
include a new cause in the network. Network con-
struction stops if this requirement is too high,
Limitations
reducing overall endorsement of the network. An
alternative criterion is inductive eliminative analysis There are several unresolved issues regarding the
(IEA), wherein every network produced when establishment of networks and the selection of cut-
working toward the MSC is checked for endorse- off points. Comparative network analysis studies
ment. Originally developed to deal with binary are necessary to compare and evaluate the differ-
adjacency matrices, networks were deemed con- ential effectiveness of the individual network anal-
sensual if endorsed by at least 50% of partici- ysis methods. The criteria for selection of cut-off
pants. However, the introduction of Likert scales points for the network, such as the MSC and cause
necessitated a modified form of IEA, whereby an to link, have also been criticized as atheoretical,
item average criterion (IAC) was adopted. The producing extremely large networks that represent
mean strength of a participant’s endorsement of an aggregate rather than a consensual solution.
all items on a network must be above the IAC, Although IEA resolves some of these issues,
which is usually set at 3 or 4 on a 5-point scale,
depending on the overall link strength. In early
Family
research, the diagrammatic networks produced
History
using these methods were topological, not spa-
Smoking
tial. However, recent studies have subjected the
matrices of causal ratings to multidimensional Stress
scaling analysis to determine the spatial structure Illness
of networks. Thus, proximal and distal effects
can be easily represented. The matrix method
has the advantage of ensuring the exhaustive Figure 1 Example of Network Diagram
Newman–Keuls Test and Tukey Test 897
producing networks that tend to be more consen- Tukey test is most commonly used in other disci-
sual, smaller, and easier to interpret, the cut-off plines. An advantage of the Tukey test is to keep
points (50% criterion and IAC) are established the level of the Type I error (i.e., finding a differ-
arbitrarily and, thus, might be contested. ence when none exists) equal to the chosen alpha
level (e.g., α ¼ :05 or α ¼ :01). An additional
Amy Brogan and David Hevey advantage of the Tukey test is to allow the compu-
tation of confidence intervals for the differences
See also Cause and Effect; Graphical Display of Data;
between the means. Although the Newman–Keuls
Likert Scaling
test has more power than the Tukey test, the exact
value of the probability of making a Type I error
Further Readings
of the Newman–Keuls test cannot be computed
Kelley, H. H. (1983). Perceived causal structures. In because of the sequential nature of this test. In
J. Jaspars, F. D. Fincham, & M. Hewstone (Eds.), addition, because the criterion changes for each
Attribution theory and research: Conceptual, level of the Newman–Keuls test, confidence inter-
developmental and social dimensions (pp. 343–369). vals cannot be computed around the differences
London: Academic Press. between means. Therefore, selecting whether to
Knoke, D., & Kuklinski, J. H. (1982). Network analysis. use the Tukey or Newman–Keuls test depends on
Beverly Hills, CA: Sage.
whether additional power is required to detect sig-
nificant differences between means.
where N is the total number of participants and K difference implies not rejecting the null hypothesis
is the number of groups, and on a parameter R; for any other difference.
which is the number of means being tested. For If the null hypothesis is rejected for the largest
example, in a group of K ¼ 5 means ordered from difference, the two differences with a range of
smallest to largest, A 1 are examined. These means will be tested
with R ¼ A 1. When the null hypothesis for
M1 < M2 < M3 < M4 < M5 a given pair of means cannot be rejected, none of
the differences included in that difference will be
R ¼ 5 when comparing M5 with M1 ; however,
tested. If the null hypothesis is rejected, then the
R ¼ 3 when comparing M3 with M1 .
procedure is reiterated for a range of A 2 (i.e.,
R ¼ A 2). The procedure is reiterated until all
F range means have been tested or have been declared non-
significant by implication.
Some statistics textbooks refer to a pseudo-F It takes some experience to determine which
distribution called the ‘‘F range’’ or ‘‘Frange ,’’ rather comparisons are implied by other comparisons.
than the Studentized q distribution. The Frange can Figure 1 describes the structure of implication for
be computed easily from q using the following a set of 5 means numbered from 1 (the smallest) to
formula: 5 (the largest). The pairwise comparisons implied
q2 by another comparison are obtained by following
Frange ¼ : ð2Þ the arrows. When the null hypothesis cannot be
2
rejected for one pairwise comparison, then all the
comparisons included in it are crossed out so that
Tukey Test they are not tested.
the table so 40 is used instead). The qcriticalð5Þ;α¼:05 is Now we proceed to test the means with a range
equal to 4.04 and the qcriticalð5Þ;α¼:01 is equal to 4.93. of 4, namely the differences (M4 M1 Þ and
When performing pairwise comparisons, it is (M5 M2 Þ. With α ¼ :05; R ¼ 4 and 45 degrees
customary to report the table of differences between of freedom, qcriticalð4Þ ¼ 3:79: Both differences are
means with an indication of their significance (e.g., declared significant at the .05 level [qobservedð4Þ ¼
one star meaning significant at the .05 level, and 3.89 in both cases]. We then proceed to test the
two stars meaning significant at the .01 level). This comparisons with a range of 3. The value of qcritical
is shown in Table 4. is now 3.44. The differences (M3 M1 Þ and
(M5 M3 Þ, both with a qobserved of 2.83, are
declared nonsignificant. Furthermore, the difference
Newman–Keuls Test (M4 M2 Þ, with a qobserved of 2.12, is also declared
Note that for the Newman–Keuls test, the nonsignificant. Hence, the null hypothesis for these
group means are ordered from the smallest to the differences cannot be rejected, and all comparisons
largest. The test starts by evaluating the largest implied by these differences should be crossed out.
difference that corresponds to the difference That is, we do not test any difference with a range
between M1 and M5 (i.e., ‘‘contact’’ and ‘‘smash’’). of A 3 ½i:e:; ðM2 M1 Þ, (M3 M2 Þ; ðM4 M3 Þ,
For α ¼ :05, R ¼ 5 and ν ¼ N K ¼ 45 degrees and (M5 M4 Þ. Because the comparisons with
of freedom, the critical value of q is 4.04 (using a range of 3 have already been tested and found to
the ν value of 40 in the table). This value is be nonsignificant, any comparisons with a range of
denoted as qcriticalð5Þ ¼ 4:04. The qobserved is com- 2 will consequently be declared nonsignificant as
puted from Equation 1 (see also Table 3) as they are implied or included in the range of 3 (i.e.,
the test has been performed implicitly).
M5 M1 As for the Tukey test, the results of the
qobserved ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ 5:66: ð3Þ
Newman–Keuls tests are often presented with
1
MSerror the values of the pairwise differences between the
S means and with stars indicating the significance
Figure 2 Newman–Keuls Test for the Data from a Replication of Loftus & Palmer (1974)
Note: The number below each range is the qobserved for that range.
902 Nominal Scale
Table 5 Presentation of the Results of the Newman– This form of scale does not require the use of
Keuls Test for the Data from Table 2 numeric values or categories ranked by class, but
Experimental Group simply unique identifiers to label each distinct
category. Often regarded as the most basic form
M1 M2 M3 M4 M5 of measurement, nominal scales are used to cate-
Contact Hit 1 Bump Collide Smash gorize and analyze data in many disciplines. His-
30 35 38 41 46 torically identified through the work of
*
M1 = 30 0 5.00 ns 8.00 ns 11.00 16.00** psychophysicist Stanley Stevens, use of this scale
Contact has shaped research design and continues to
M2 = 35 0 3.00 ns 6.00 ns 11.00* impact on current research practice. This entry
Hit presents key concepts, Stevens’s hierarchy of
M3 = 38 0 3.00 ns 8.00 ns measurement scales, and an example demon-
Bump strating the properties of the nominal scale.
M4 = 41 0 5.00 ns
Collide
M5 = 46 0 Key Concepts
Smash The nominal scale, which is often referred to as
Notes: * p < . 05. **
p < :01: the unordered categorical or discrete scale, is used
to assign individual datum into categories. Cate-
level (see Table 5). The comparison of Table 5 and gories in the nominal scale are mutually exclusive
Table 4 confirms that the Newman–Keuls test is and collectively exhaustive. They are mutually
more powerful than the Tukey test. exclusive because the same label is not assigned to
different categories and different labels are not
Herve Abdi and Lynne J. Williams assigned to events or objects of the same category.
Categories in the nominal scale are collectively
See also Analysis of Variance (ANOVA); Bonferroni
exhaustive because they encompass the full range
Procedure; Holm’s Sequential Bonferroni Procedure;
of possible observations so that each event or
Honestly Significant Difference (HSD) Test; Multiple
object can be categorized. The nominal scale holds
Comparison Tests; Pairwise Comparisons; Post Hoc
two additional properties. The first property is that
Comparisons; Scheffe Test
all categories are equal. Unlike in other scales,
such as ordinal, interval, or ratio scales, categories
Further Readings
in the nominal scale are not ranked. Each category
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. has a unique identifier, which might or might not
(2009). Experimental design and analysis for be numeric, which simply acts as a label to distin-
psychology. Oxford, UK: Oxford University Press. guish categories. The second property is that the
Dudoit S., & van der Laan, M. (2008). Multiple testing
nominal scale is invariant under any transforma-
procedures with applications to genomics. New York:
tion or operation that preserves the relationship
Springer-Verlag.
Hochberg, Y., & Tamhane, A. C. (1987). Multiple between individuals and their identifiers.
comparison procedures. New York: Wiley. Some of the most common types of nominal
Jaccard, J., Becker, M. A., & Wood, G. (1984). Pairwise scales used in research include sex (male/female),
multiple comparison procedures: A review. marital status (married or common-law/widowed/
Psychological Bulletin, 94, 589–596. divorced/never-married), town of residence, and
questions requiring binary responses (yes/no).
NOMINAL SCALE
Stevens’s Hierarchy
A nominal scale is a scale of measurement used to In the mid-1940s, Harvard psychophysicist
assign events or objects into discrete categories. Stanley Stevens wrote the influential article ‘‘On
Nominal Scale 903
the Theory of Scales of Measurement,’’ pub- Table 1 Class List for Attendance on May 1
lished in Science in 1946. In this article, Stevens
Arrives by Attendance
described a hierarchy of measurement scales that
Student ID School Bus on May 1
includes nominal, ordinal, interval, and ratio
scales. Based on basic empirical operations, 001 Yes Absent
mathematical group structure, and statistical 002 Yes Absent
procedures deemed permissible, this hierarchy 003 Yes Present
has been used in textbooks worldwide and con- 004 Yes Absent
tinues to shape statistical reasoning used to 005 Yes Absent
guide the design of statistical software packages 006 Yes Absent
today. 007 Yes Absent
Under Stevens’s hierarchy, the primary, and 008 Yes Absent
arguably only, use for nominal scales is to deter- 009 Yes Absent
mine equality, that is, to determine whether the 010 No Present
object of interest falls into the category of inter- 011 No Present
est by possessing the properties identified for 012 No Present
that category. Stevens argued that no other 013 No Present
determinations were permissible, whereas others 014 No Absent
argued that even though other determinations 015 No Present
were permissible, they would, in effect, be mean-
ingless. A less argued property of the nominal
scale is that it is invariant under any transforma- The header row denotes the names of the vari-
tion. When taking attendance in a classroom, for able to be categorized and each row contains an
example, those in attendance might be assigned individual student record. Student 001, for exam-
1, whereas those who are absent might be ple, uses the school bus and is absent on the day in
assigned 2. This nominal scale could be replaced question. An appropriate nominal scale to catego-
by another nominal scale, where ‘‘1’’ is replaced rize class attendance would involve two categories:
by the label ‘‘present’’ and ‘‘2’’ is replaced by the absent or present. Note that these categories are
label ‘‘absent.’’ The transformation is considered mutually exclusive (a student cannot be both pres-
invariant because the identity of each individual ent and absent), collectively exhaustive (the cate-
is preserved. Given the limited determinations gories cover all possible observations), and each is
deemed permissible, Stevens proposed a restric- equal in value.
tion on analysis for nominal scales. Only basic Permissible statistics for the attendance variable
statistics are deemed permissible or meaningful would include frequency, mode, and contingency
for the nominal scale, including frequency, mode correlation. Using the previously provided class
as the sole measure of central tendency, and list, the frequency of those present is 6 and those
contingency correlation. Despite much criticism absent is 9. The mode, or the most common obser-
during the past 50 years, statistical software vation, is ‘‘absent.’’ Contingency tables could be
developed during the past decade has sustained constructed to answer questions about the popula-
the use of Stevens’s terminology and permissibil- tion. If, for example, a contingency table was used
ity in its architecture.
Table 2 Contingency Table for Attendance and
Arrives by School Bus
Example: Attendance in the Classroom Attendance
Again, attendance in the classroom can serve as Absent Present Total
an example to demonstrate some properties of the Arrives by Yes 8 1 9
nominal scale. After taking attendance, the infor- school bus No 1 5 6
mation has been recorded in the class list as illus-
Total 9 6 15
trated in Table 1.
904 Nomograms
to classify students using the two variables atten- intervention studies that have greater statistical
dance and arrives by school bus, then Table 2 power by targeting the enrollment of patients with
could be constructed. the highest risk of disease. In addition, nomograms
The results of the Fisher’s exact test for contin- rely on well-designed studies to validate the accu-
gency table analysis show that those who arrive racy of their predictions.
by school bus were significantly more likely to
be absent than those who arrive by some other
Deriving Outcome Probabilities
means. One might then conclude that the school
bus was late. All medical decisions are based on the predicted
probability of different outcomes. Imagine a 35-
Deborah J. Carr year-old patient who presents to a physician with
a 6-month history of cough. A doctor in Chicago
See also Chi-Square Test; Frequency Table; Mode;
might recommend a test for asthma, which is
‘‘On the Theory of Scales of Measurement’’;
a common cause of chronic cough. If the same
Ordinal Scale
patient presented to a clinic in rural Africa, the
physician might be likely to test for tuberculosis.
Further Readings Both physicians might be making sound recom-
mendations based on the predicted probability of
Duncan, O. D. (1984). Notes on social measurement:
disease in their locale. These physicians are making
Historical and critical. New York: Russell Sage
clinical decisions based on the overall probability
Foundation.
Michell, J. (1986). Measurement scales and statistics: A of disease in the population. These types of deci-
clash of paradigms. Psychological Bulletin, sions are better than arbitrary treatment but treat
3; 398–407. all patients the same.
Stevens, S. S. (1946). On the theory of scales of A more sophisticated method for medical deci-
measurement. Science, 103; 677–680. sion making is risk stratification. Physicians will
Velleman, P. F., & Wilkinson, L. (1993). Nominal, frequently assign patients to different risk groups
ordinal, interval, and ratio typologies are misleading. when making treatment decisions. Risk group
The American Statistician, 47; 65–72. assignment will generally provide better predicted
probabilities than estimating risk according to the
overall population. In the previous cough example,
a variety of other factors might impact the pre-
NOMOGRAMS dicted risk of tuberculosis (e.g., fever, exposure to
tuberculosis, and history of tuberculosis vaccine)
Nomograms are graphical representations of equa- that physicians are trained to explore. Most risk
tions that predict medical outcomes. Nomograms stratification performed in clinical practice is based
use a points-based system whereby a patient accu- on rough estimates that simply order patients
mulates points based on levels of his or her risk into levels of risk, such as high risk, medium
factors. The cumulative points total is associated risk, or low risk. Nomograms provide precise
with a prediction, such as the predicted probability probability estimates that generally make more
of treatment failure in the future. Nomograms accurate assessments of risk.
can improve research design, and well-designed A problem with risk stratification arises when
research is crucial for the creation of accurate continuous variables are turned into categorical
nomograms. Nomograms are important to variables. Physicians frequently commit dichoto-
research design because they can help identify the mized cutoffs of continuous laboratory values to
characteristics of high-risk patients while high- memory to guide clinical decision making. For
lighting which interventions are likely to have the example, blood pressure cut-offs are used to guide
greatest treatment effects. Nomograms have dem- treatment decisions for hypertension. Imagine
onstrated better accuracy than both risk grouping a new blood test called serum marker A. Research
systems and physician judgment. This improved shows that tuberculosis patients with serum
accuracy should allow researchers to design marker A levels greater than 50 are at an increased
Nomograms 905
Points 0 50 100
Yes
Fever
No
Age
80 60 50 40 30 20 10
Yes
Cough
No Hemoptysis
0 10 20 40 60 70 80 90 100
Marker A
Yes
History of Tb Vaccine
No
Yes
Intubated
No
Total Points
0 200 400
Mortality Probability
.975 0.95 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01
INSTRUCTIONS: Locate the tic mark associated with the value of each predictor variable. Use a straight edge
to find the corresponding points on the top axis for each variable. Calculate the total points by summing the
individual points for all of the variables. Draw a vertical line from the value on the total points axis to the
bottom axis in order to determine the hypothetical mortality probability from tuberculosis.
risk for dying from tuberculosis. In reality, patients required intubation has a much greater possible
with a value of 51 might have similar risks com- impact on the predicted probability of mortality.
pared with patients with a value of 49. In contrast, Nomograms like the one shown in Figure 1 are
a patient with a value of 49 would be considered created from the coefficients obtained by the statis-
to have the same low risk of a patient whose tical model (e.g., logistic regression or Cox propor-
serum level of marker A is 1. Nomograms allow tional hazards regression) and are only as precise
for predictor variables to be maintained as contin- as the paper graphics. However, the coefficients
uous values while allowing numerous risk factors used to create the paper-based nomogram can be
to be considered simultaneously. In addition, more used to calculate the exact probability. Similarly,
complex models can be constructed that account the coefficients can be plugged into a Microsoft
for interactions. Excel spreadsheet or built into a computer inter-
Figure 1 illustrates a hypothetical nomogram face that will automatically calculate the probabil-
designed to predict the mortality probability for ity based on the user inputs.
patients with tuberculosis. The directions for using
the nomogram are contained in the legend. One
glance at the nomogram allows the user to deter- Why Are Nomograms
mine quickly which predictors have the greatest
Important to Research Design?
potential impact on the probability. Fever has a rel-
atively short axis and can contribute less than 25 Nomograms provide an improved ability to iden-
possible points. In contrast, whether the patient tify the correct patient population for clinical
906 Nomograms
studies. The statistical power in prospective studies in the original data set and is available for subse-
with dichotomous clinical outcomes is derived quent random selection. The random selection of
from the number of events. (Enrolling excessive patients is continued until a dataset that is the
numbers of patients who do not develop the event same size as the original data has been formed.
of interest is an inefficient use of resources.) The model is applied (i.e., fit) to the bootstrap
For instance, let us suppose that a new controver- data and the model is graded on its ability to
sial medication has been developed for treating predict accurately the outcome of patients in
patients with tuberculosis. The medication either the original data (apparent accuracy) or
shows promise in animal studies, but it also seems the bootstrap sample (unbiased accuracy). Alter-
to carry a high risk of serious toxicity and even natively, the original data can be partitioned ran-
death in some individuals. Researchers might want domly. The model is fit to only a portion of the
to determine whether the medication improves original data and the outcome is predicted in the
survival in some patients with tuberculosis. The remaining subset. The bootstrap method has
nomogram in Figure 1 could be used to identify the added benefit that the sample size used for
which patients are at highest risk of dying, are the model fitting is not reduced.
most likely to benefit from the new drug, and
therefore should be enrolled in a drug trial. The
Evaluating Model Accuracy
nomogram could also be tested in this fashion
using a randomized clinical trial design. One arm As mentioned previously, the models’ predic-
of the study could be randomized to usual care, tions are evaluated on their ability to discrimi-
whereas the treatment arm is randomized to use nate between pairs of discordant patients
the Tb nomogram and then to receive the usual (patients who had different outcomes). The
care if the risk of mortality is low or the experi- resultant evaluation is called a concordance
mental drug if the risk of mortality is high. index or c-statistic. The concordance index is
simply the proportion of the time that the model
accurately assigns a higher risk to the patient
Validation
with the outcome. The c-statistic can vary from
The estimated probability obtained from nomo- 0.50 (equivalent to the flip of a coin) to 1.0 (per-
grams like the one in Figure 1 is generally much fect discrimination). The c-statistic provides an
more accurate than rough probabilities obtained objective method for evaluating model accuracy,
by risk stratification and should help both patients but the minimum c-statistic needed to claim that
and physicians make better treatment decisions. a model has good accuracy depends on the spe-
However, nomograms are only as good as the data cific condition and is somewhat subjective. How-
that were used for their creation. But, predicted ever, models are generally not evaluated in
probabilities can be graded (validated) on their isolation. Models can be compared head-to-head
ability to discriminate between pairs of patients either with one another or with physician judg-
who have different outcomes (discordant pairs). ment. In this case, the most accurate model can
The grading can be performed using either a valida- generally be identified as the one with the high-
tion data set that was created with the same data- est concordance index.
base used to create the prediction model (internal However, to grade a model fully, it is also
validation) or with external data (external valida- necessary to determine a model’s calibration. Cali-
tion). Ideally, a nomogram should be validated in bration is a measure of how close a model’s predic-
an external database before it is widely used in tion compares with the actual outcome and is
heterogeneous patient populations. frequently displayed by plotting the predicted
A validation data set using the original data can probability (or value) versus the actual proportion
be created either with the use of bootstrapping or with the outcome (or actual value). The concor-
by dividing the data set into random partitions. In dance index is simply a ‘‘rank’’ test that orders
the bootstrap method, a random patient is selected patients according to risk. A model can theoreti-
and a copy of the patient’s data is added to the val- cally have a great concordance index but poor cali-
idation data set. The patient’s record is maintained bration. For instance, a model might rank patients
Nonclassical Experimenter Effects 907
Conclusion
Designing efficient clinical research, especially
when designing prospective studies, relies on accu- NONCLASSICAL EXPERIMENTER
rate predictions of the possible outcomes. Nomo- EFFECTS
grams provide an opportunity for researchers to
easily identify the target population that will be Experimenter effects denominate effects where an
predicted to have the highest incidence of events outcome seems to be a result of an experimental
and will therefore keep the necessary sample size intervention but is actually caused by conscious or
low. Paper-based nomograms provide an excellent unconscious effects the experimenter has on how
medium for easily displaying risk probabilities and data are produced or processed. This could be
do not require a computer or calculator. The coef- through inadvertently measuring one group differ-
ficients used to construct the nomogram can be ently from another one, treating a group of people
used to create a computer-based prediction tool. or animals that are known to receive or to have
However, nomograms are only as good as the received the intervention differently compared with
data that were used in their creation, and no the control group, or biasing the data otherwise.
nomogram can provide a perfect prediction. Ulti- Normally, such processes happen inadvertently
mately, the best evaluation of a nomogram is made because of expectation and because participants
by validating the prediction accuracy of a nomo- sense the desired outcome in some way and hence
gram on an external data set and comparing comply or try to please the experimenter. Control
the concordance index with another prediction procedures, such as blinding (keeping participants
method that was validated using the same data. and/or experimenters unaware of a study’s critical
The validation of nomograms provides another aspects), are designed to keep such effects at bay.
opportunity for research design. Prospective stud- Whenever the channels by which such effects are
ies that collect all the predictor variables needed to transmitted are potentially known or knowable,
calculate a specific nomogram are ideal for deter- the effect is known as a classical experimenter
mining a nomogram’s accuracy. In addition, more effect. They normally operate through the
randomized controlled trials that compare nomo- known senses and very often by subliminal per-
gram-derived treatment recommendations versus ception. If an experiment is designed to exclude
standard of care are needed to promote the use of such classical channels of information transfer,
nomograms in medicine. because it is testing some claims of anomalous
Brian J. Wells and Michael Kattan cognition, and such differential effects of experi-
menters still happen, then these effects are called
See also Decision Rule; Evidence-Based Decision nonclassical experimenter effects, because there
Making; Probability, Laws of is no currently accepted model to understand
how such effects might have occurred in the first
place.
Further Readings
Harrell, F. E., Jr. (1996). Multivariate prognostic models:
Issues in developing models, evaluating assumptions Empirical Evidence
and accuracy, and measuring and predicting errors.
Statistics in Medicine, 15, 361.
This effect has been known in parapsychological
Harrell, F. E., Jr., Califf, R. M., Pryor, D. B., Lee, K. L., research for awhile. Several studies reported that
& Rosati, R. A. (1982). Evaluating the yield of parapsychological effects were found in some stud-
medical tests. Journal of the American Medical ies, whereas in other studies with the same experi-
Association, 247; 2543–2546. mental procedure, the effects were not shown. A
908 Nonclassical Experimenter Effects
well-known experiment that has shown such a was no indication how the individual in question
nonclassical experimenter effect is one where could have potentially biased this blinded system,
a parapsychological researcher who had previously although such tampering, and hence a classical
produced replicable results with a certain experi- experimenter effect, could not be excluded.
mental setup invited a skeptical colleague into her The nonclassical experimenter effect has been
laboratory to replicate the experiment with her. shown repeatedly in parapsychological research.
They ran the same experiment together; half of the The source of this effect is unclear. If the idea
subjects were introduced to the experimental behind parapsychology that intention can affect
procedures by the enthusiastic experimenter and physical systems without direct interaction is at all
half by the skeptical experimenter. The experimen- sensible and worth any consideration, then there is
tal task was to influence a participant’s arousal no reason why the intention of an experimenter
remotely, measured by electrodermal activity, via should be left out of an experimental system in
intention only according to a random sequence. question. Furthermore, one could argue that if the
The two participants were separated from each intention of an experimental participant could
other and housed in shielded chambers. Otherwise, affect a system without direct interaction, then the
all procedures were the same. Although the enthu- intention of the experimenter could do the same.
siastic researcher could replicate the previous Strictly speaking, any nonclassical experimenter
results, the skeptical researcher produced null effect defies experimental control and calls into
results. This finding occurred even though there question the concept of experimental control.
was no way of transferring the information in the
experiment itself. This result was replicated in
another study in the skeptical researcher’s labora- Theoretical Considerations
tory, where again the enthusiastic researcher could When it comes to understanding such effects, they
replicate the findings but the skeptic could not. are probably among the strongest empirical facts
There are also several studies reported where more that point to a partially constructivist view of the
than one experimenter interacted with the partici- world that is also embedded in some spiritual
pants. If these studies are evaluated separately for worldviews such as in the Buddhist, Vedanta, or
each experimenter, it could be shown that some other mystical concepts. Here, our mental con-
experimenters find consistently significant results structs, intentions, thoughts, and wishes are not
whereas others do not. These are not only explor- only reflections of the world or idle mental opera-
atory findings because some of these studies could tions that might affect the world indirectly by
be repeated and the experimenter effects were being responsible for our future actions but also
hypothesized. could be viewed as constituents and creators of
Another experimental example are the so-called reality itself. This is difficult to understand within
memory-of-water effects, where Jacques Benve- the accepted scientific framework of the world.
niste, who was a French immunologist, had Hence, such effects and a constructivist concept of
claimed that water mixed with an immunogenic reality also point to the limits of the validity of our
substance and successively diluted in steps to current worldview. For such effects to be scientifi-
a point where no original molecules were present cally viable concepts, researchers need to envisage
would still have a measurable effect. Blinded a world in which mental and physical acts can
experiments produced some results, sometimes interact with each other directly. Such effects make
replicable and sometimes not. Later, he claimed us aware of the fact that we constantly partition
that such effects can also be digitized, recorded, our world into compartments and pieces that are
and played back via a digital medium. A definitive useful for certain purposes, for instance, for the
investigation could show that these effects only purpose of technical control, but do not necessar-
happened when one particular experimenter was ily describe reality as such. In this sense, they
present who was known to be indebted to remind us of the constructivist basis of science and
Benveniste and wanted the experiments to work. the whole scientific enterprise.
Although a large group of observers with special-
ists from different disciplines were present, there Harald Walach and Stefan Schmidt
Nondirectional Hypotheses 909
See also Experimenter Expectancy Effect; Hawthorne tested relationship, stating that one variable is pre-
Effect; Rosenthal Effect dicted to be larger or smaller than null value, but
not both. Choosing a nondirectional or directional
alternative hypothesis is a basic step in conducting
Further Readings
a significance test and should be based on the
Collins, H. M. (1985). Changing order. Beverly Hills, research question and prior study in the area. The
CA: Sage. designation of a study’s hypotheses should be
Jonas, W. B., Ives, J. A., Rollwagen, F., Denman, D. W., made prior to analysis of data and should not
Hintz, K., Hammer, M., et al. (2006). Can specific change once analysis has been implemented.
biological signals be digitized? FASEB Journal, 20, For example, in a study examining the
23–28. effectiveness of a learning strategies intervention,
Kennedy, J. E., & Taddonio, J. L. (1976). Experimenter
a treatment group and a control group of
effects in parapsychological research. Journal of
Parapsychology, 40, 1–33. students are compared. The null hypothesis
Palmer, J. (1997). The challenge of experimenter psi. states that there is no difference in mean scores
European Journal of Parapsychology, 13, 110–125. between the two groups. The nondirectional
Smith, M. D. (2003). The role of the experimenter in alternative hypothesis states that there is a differ-
parapsychological research. Journal of Consciousness ence between the mean scores of two groups but
Studies, 10, 69–84. does not specify which group is expected to be
Walach, H., & Schmidt, S. (1997). Empirical larger or smaller. In contrast, a directional alter-
evidence for a non-classical experimenter effect: An native hypothesis might state that the mean of
experimental, double-blind investigation of
the treatment group will be larger than the mean
unconventional information transfer. Journal of
of the control group. The null and the nondirec-
Scientific Exploration, 11, 59–68.
Watt, C. A., & Ramakers, P. (2003). tional alternative hypothesis could be stated as
Experimenter effects with a remote facilitation of follows:
attention focusing task: A study with
multiple believer and disbeliever experimenters. Null Hypothesis: H0 : μ1 μ2 ¼ 0.
Journal of Parapsychology, 67, 99–116.
Wiseman, R., & Schlitz, M. (1997). Experimenter effects Nondirectional Alternative Hypothesis:
and the remote detection of staring. Journal of H1 : μ1 μ2 6¼ 0:
Parapsychology, 61, 197–208.
A common application of nondirectional
hypothesis testing involves conducting a t test
and comparing the means of two groups. After
NONDIRECTIONAL HYPOTHESES calculating the t statistic, one can determine the
critical value of t that designates the null hypoth-
A nondirectional hypothesis is a type of alterna- esis rejection region for a nondirectional or two-
tive hypothesis used in statistical significance tailed test of significance. This critical value will
testing. For a research question, two rival depend on the degrees of freedom in the sample
hypotheses are formed. The null hypothesis and the desired probability level, which is usu-
states that there is no difference between the ally .05. The rejection region will be represented
variables being compared or that any difference on both sides of the probability curve because
that does exist can be explained by chance. The a nondirectional hypothesis is sensitive to a larger
alternative hypothesis states that an observed or smaller effect.
difference is likely to be genuine and not likely Figure 1 shows a distribution in which at the
to have occurred by chance alone. Sometimes 95% confidence level, the solid regions at the top
called a two-tailed test, a test of a nondirectional and bottom of the distribution represent 2.5%
alternative hypothesis does not state the direc- accumulated probability in each tail. If the calcu-
tion of the difference, it indicates only that a lated value for t exceeds the critical value at either
difference exists. In contrast, a directional alter- tail of the distribution, than the null hypothesis
native hypothesis specifies the direction of the can be rejected.
910 Nonexperimental Designs
a purpose other than the experiment. In quasi- also referred to as differential or ex post facto
experimental designs, the experimenter can still designs), correlational designs, developmental
manipulate the value of the independent variable, designs, one-group pretest–posttest designs, and
even though the groups to be compared are already finally posttest only nonequivalent group designs.
established. In nonexperimental designs, the groups
already exist and the experimenter cannot or does
Comparative Designs
not attempt to manipulate an independent variable.
The experimenter is simply comparing the existing In these designs, two or more groups are com-
groups based on a variable that the researcher did pared on one or more measures. The experimenter
not manipulate. The researcher simply compares might collect quantitative data and look for statis-
what is already established. Because he or she can- tically significant differences between groups, or
not manipulate the independent variable, it is the experimenter might collect qualitative data
impossible to establish a causal relationship and compare the groups in a more descriptive
between the variables measured in a nonexperimen- manner. Of course, the experimenter might also
tal design. use mixed methods and do both of the previously
A nonexperimental design might be used when mentioned strategies. Conclusions can be drawn
an experimenter would like to know about the about whether differences exist between groups,
relationship between two variables, like the fre- but the reasons for the differences cannot be
quency of doctor visits for people who are obese drawn conclusively. The study described previ-
compared with those who are of healthy weight or ously regarding obese, healthy weight, and under-
are underweight. Clearly, from both an ethical and weight people’s doctor visits is an example of
logistical standpoint, an experimenter could not a comparative design.
simply select three groups of people randomly
from a population and make one of the groups
Causal-Comparative, Differential, or Ex Post
obese, one of the groups healthy weight, and one
Facto Research Designs
of the groups underweight. The experimenter
could, however, find obese, healthy weight, and Nonexperimental research that is conducted
underweight people and record the number of doc- when values of a dependent variable are compared
tor visits the members of each of these groups have based on a categorical independent variable is
to look at the relationship between the variables of often referred to as a causal-comparative or a differ-
interest. This nonexperimental design might yield ential design. In these designs, the groups are deter-
important conclusions even though a causal rela- mined by their values on some preexisting
tionship could not clearly be established between categorical variable, like gender. This design is also
the variables. sometimes called ex post facto for that reason; the
group membership is determined after the fact.
After determining group membership, the groups
Types of Nonexperimental Designs
are compared on the other measured dependent
Although the researcher does not assign partici- variable. The researcher then tests for statistically
pants to groups in nonexperimental design, he or significant differences in the dependent variable
she can usually still determine what is measured between groups. Even though this design is referred
and when it will be measured. So despite the lack to as causal comparative, a causal relationship can-
of control in aspects of the experiment that are not be established using this design.
generally important to researchers, there are still
ways in which the experimenter can control the
Correlational Designs
data collection process to obtain interesting and
useful data. Various authors classify nonexperi- In correlational designs, the experimenter
mental designs in a variety of ways. In the sub- measures two or more nonmanipulated variables
sequent section, six types of frequently used for each participant to ascertain whether linear
nonexperimental designs are discussed: compara- relationships exist between the variables. The
tive designs, causal-comparative designs (which are researcher might use the correlations to conduct
912 Nonexperimental Designs
subsequent regression analyses for predicting the variables possibly causing change over time. As
values of one variable from another. No conclu- with all nonexperimental designs, the researcher
sions about causal relationships can be drawn does not control the independent variable. How-
from correlational designs. It is important to note, ever, this design is generally used when a researcher
however, that correlational analyses might also be knows that an intervention of some kind will be
used to analyze data from experimental or quasi- taking place in the future. Thus, although the
experimental designs. researcher is not manipulating an independent var-
iable, someone else is. When the researcher knows
this will occur before it happens, he or she can col-
Developmental Designs
lect pretest data, which are simply data collected
When a researcher is interested in developmen- before the intervention. An example of this design
tal changes that occur over time, he or she might would be if a professor wants to study the impact
choose to examine the relationship between age of a new campus-wide recycling program that will
and other dependent variables of interest. Clearly, be implemented soon. The professor might want
the researcher cannot manipulate age, so develop- to collect data on the amount of recycling that
mental studies are often conducted using nonex- occurs on campus before the program and on atti-
perimental designs. The researcher might find tudes about recycling before the implementation
groups of people at different developmental stages of the program. Then, perhaps 6 months after the
or ages and compare them on some characteristics. implementation, the professor might want to col-
This is essentially a form of a differential or lect the same kind of data again. Although the pro-
causal-comparative design in that group member- fessor did not manipulate the independent variable
ship is determined by one’s value of a categorical of the recycling program and did not randomly
variable. Although age is not inherently a categori- assign students to be exposed to the program, con-
cal variable, when people are grouped together clusions about changes that occurred after the
based on categories of ages, age acts as a categori- program can still be drawn. Given the lack of
cal variable. manipulation of the independent variable and the
Alternatively, the researcher might investigate lack of random assignment of participants, the
one group of people over time in a longitudinal study is nonexperimental research.
study to examine the relationship between age
and the variables of interest. For example, the
Posttest-Only Nonequivalent
researcher might be interested in looking at how
Control Group Design
self-efficacy in mathematics changes as children
grow up. He or she might measure the math self- In this type of between-subjects design, two
efficacy of a group of students in 1st grade and nonequivalent groups of participants are com-
then measure that same group again in the 3rd, pared. In nonexperimental research, the groups are
5th, 7th, 9th, and 11th grades. In this case, chil- almost always nonequivalent because the partici-
dren were not randomly assigned to groups and pants are not randomly assigned to groups.
the independent variable (age) was not manipu- Because the researcher also does not control the
lated by the experimenter. These two characteris- intervention, this design is used when a researcher
tics of the study qualify it as nonexperimental wants to study the impact of an intervention that
research. already occurred. Given that the researcher cannot
collect pretest data, he or she collects posttest data.
However, to draw any conclusions about the post-
One-Group Pretest–Posttest Design
test data, the researcher collects data from two
In this within-subjects design, each individual in groups, one that received the treatment or inter-
a group is measured once before and once after vention, and one that did not. For example, if one
a treatment. In this design, the researcher is not is interested in knowing how participating in
examining differences between groups but examin- extracurricular sports during high school affects
ing differences across time in one group. The students’ attitudes about the importance of physi-
researcher does not control for possible extraneous cal fitness in adulthood, an experimenter might
Nonexperimental Designs 913
survey students during the final semester of their that are between careers of their choosing opt for
senior year. The researcher could survey a group jobs in food service. Another possibility is that peo-
that participated in sports and a group that did ple with more education are more satisfied with
not. Clearly, he or she could not randomly assign their jobs, and people in academia tend to be the
students to participate or not participate. In this most educated, followed by those in business and
case, he or she also could not compare the atti- then those in food service. Thus, if the researcher
tudes prior to participating with those after partici- found that academics are the most satisfied, it
pating. Obviously with no pretest data and with might be because of their jobs, or it might be
groups that are nonequivalent, the conclusions because of their education. These proposed ratio-
drawn from these studies might be lacking in inter- nales are purely speculative; however, they demon-
nal validity. strate how internal validity might be threatened by
self-selection. In both cases, a third variable exists
that contributes to the differences between groups.
Threats to Internal Validity
Third variables can threaten internal validity in
Internal validity is important in experimental numerous ways with nonexperimental research.
research designs. It allows one to draw unambigu-
ous conclusions about the relationship between
Assignment Bias
two variables. When there is more than one possi-
ble explanation for the relationship between Like self-selection, the assignment of partici-
variables, the internal validity of the study is pants to groups in a nonrandom method can
threatened. Because the experimenter has little create a threat to internal validity. Although parti-
control over potential confounding variables in cipants do not always self-select into groups used
nonexperimental research, the internal validity can in nonexperimental designs, when they do not self-
be threatened in numerous ways. select, they are generally assigned to a group for
a particular reason by someone other than the
researcher. For example, if a researcher wanted
Self-Selection
to compare the vocabulary acquisition of students
The most predominant threat with nonexperi- exposed to bilingual teachers in elementary
mental designs is caused by the self-selection that schools, he or she might compare students taught
often occurs by the participants. Participants in by bilingual teachers with students taught by
nonexperimental designs often join the groups to monolingual teachers in one school. Students might
be compared because of an interest in the group or have been assigned to their classes for reasons
because of life circumstances that place them in related to their skill level in vocabulary related
those groups. For example, if a researcher wanted tasks, like reading. Thus, any relationship the
to compare the job satisfaction levels of people in researcher finds might be caused not by the expo-
three different kinds of careers like business, acade- sure to a bilingual teacher but by a third variable
mia, and food service, he or she would have to use like reading level.
three groups of people that either intentionally
chose those careers or ended up in their careers
History and Maturation
because of life circumstances. Either way, the
employees in those careers are likely to be in those In nonexperimental designs, an experimenter
different careers because they are different in other might simply look for changes across time in
ways, like educational background, skills, and a group. Because the experimenter does not con-
interests. Thus, if the researcher finds differences in trol the manipulation of the independent variable
job satisfaction levels, they might be because the or group assignment, both history and maturation
participants are in different careers or they might can affect the measures collected from the partici-
be because people who are more satisfied with pants. Some uncontrolled event (history) might
themselves overall choose careers in business, occur that might confuse the conclusions drawn
whereas those who do not consider their satisfac- by the experimenter. For example, in the job
tion in life choose careers in academia, and those satisfaction study above, if the researcher was
914 Nonexperimental Designs
looking at changes in job satisfaction over time design to acquire as much information about the
and during the course of the study the stock mar- program’s effectiveness as possible rather than sim-
ket crashed, then many of those with careers in ply to not attempt to study the effectiveness of the
business might have become more dissatisfied with program.
their jobs because of that event. However, a stock Even though nonexperimental designs give the
market crash might not have affected academics experimenter little control over the experimental
and food service workers to the same extent that it process, the experimenter can improve the reliabil-
affected business workers. Thus, the conclusions ity of the findings by replicating the study. Addi-
that might be formed about the dissatisfaction tionally, one important feature of nonexperimental
of business employees would not have internal designs is the possibility of stronger ecological
validity. validity than one might obtain with a controlled,
Similarly, in the vocabulary achievement exam- experimental design. Given that nonexperimental
ple above, one would expect elementary students’ designs are often conducted with preexisting inter-
vocabularies to improve simply because of matura- ventions with ‘‘real people’’ in the ‘‘real world,’’
tion over the course of a school year. Thus, if the rather than participants in a laboratory, the find-
experimenter only examined differences in vocabu- ings are often more likely to be true to other real-
lary levels for students with bilingual teachers over world situations.
the course of the school year, then he or she might
draw erroneous conclusions about the relationship Jill H. Lohmeier
between vocabulary performance and exposure to
See also Experimental Design; Internal Validity; Quasi-
a bilingual teacher when in fact no relationship
Experimental Design; Random Assignment; Research
exists. This maturation of the students would be
Design Principles; Threats to Validity; Validity of
a threat to the internal validity of that study.
Research Conclusions
the sums of the ranks corresponding to the individual factor effects in a nonparametric
positive and negative differences. If the alterna- way, researchers could employ a rank transform
tive hypothesis specifies the median is greater, method to test for the treatment effect of
less than, or unequal, then the test statistic will interest. Assume Xijn ¼ θ þ αi þ βj þ eijn ; while
be Wþ ; W ; or MINðW ; Wþ Þ; respectively. For i ¼ 1; . . . ; I indexes for the blocks, j ¼ 1; . . . ; J
small samples with N less than 30; the table of indexes for the treatment levels, and
critical values are tabulated, whereas N is large, n ¼ 1; . . . ; N indexes for the replicates. The
normal approximation could be used to assess null hypothesis to be tested is H0 : βj ¼ 0,
the significance. Note that the Wilcoxon signed- j ¼ 1; . . . ; J; versus the alternative hypothesis
rank test can be used directly on one sample H1 : at least one βj 6¼ 0: The noises eijn are
to test whether the population median is zero assumed to be independent and identically dis-
or not. tributed with certain distribution F: The rank
For comparisons of multiple populations, the transform method proposed by W. J. Conover
nonparametric counterpart of the analysis of and Ronald L. Iman consists of replacing the
variance (ANOVA) test is the Kruskal–Wallis observations by their ranks in the overall sample
k-sample test proposed by William H. Kruskal and then performing one of the standard
and W. Allen Wallis in 1952. Given independent ANOVA procedures on these ranks. Let Rijn be
random samples of sizes N1 ; N2 ; . . . ; NK ; drawn the rank corresponding to the observation Xijn ;
from k populations, the null hypothesis is that P P P
and Rij · ¼ 1=N n Rijn ; R · ¼ 1=ðNIÞ i n Rijn :
all the k populations are identical and have the
same median; the alternative hypothesis is that The Hora–Conover statistic proposed by Ste-
at least one of the populations has a median dif- phen C. Hora and Conover takes the form of
ferent from the others. Let N denote the total P 2
number of measurements in the k samples, NI ðR · R · : · Þ =ðJ 1Þ
P j
N ¼ ki¼1 Nk : Let Ri denote the sum of the ranks F ¼ PPP 2
:
associated with the ith sample. It can be shown ðRijk Rij · Þ =IJðN 1Þ
i j k
that the grand mean rank is N 2þ 1 ; whereas the
Ri
sample mean rank for the ith sample is N : The When sample size is large, either the number of
i
test statistic takes the form of replicates per cell N → ∞; or the number of blocks
I → ∞; the F statistic has a limiting χ2J 1 distribu-
X k 2 tion. The F statistic resembles the analysis of vari-
12 Ri Nþ1
Ni : ance statistic in which the actual observation Xijk s
NðN þ 1Þ i¼1 Ni 2
are replaced by Rijk s. Such an analysis is easy to
perform, as most software implement ANOVA
Ri
The term of ðN N 2þ 1Þ measures the deviation procedures. This method has wide applicability in
i
of the ith sample rank mean away from the grand the analysis of experimental data because of its
rank mean. The term of NðN12þ 1Þ is the inverse of the robustness and simplicity in use.
However, the Hora–Conover statistic F cannot
variance of the total summation of ranks, and there-
handle unbalanced designs that often arise in prac-
fore it serves as a standardization factor. When N is
tice. Consider the following unbalanced design:
fairly large, the asymptotic distribution of Kruskal–
Wallis statistic can be approximated by the chi-
Xijn ¼ θ þ αi þ βj þ εijn ;
squared distribution with k 1 degrees of freedom.
where i ¼ 1; . . . ; I and j ¼ 1; . . . ; J index levels
Tests for Factorial Design for factors
P A and B, respectively, and n ¼ 1; . . . ;
nij ; N ¼ ij nij : We wish to test the hypothesis:
Test for Main Effects
H0 : βj ¼ 08j versus H1 : βj 6¼ 0 for some j: To
In practice, data are often generated from address the problem of unbalance in designs, let us
experiments with several factors. To assess the examine the composition of a traditional rank.
Nonparametric Statistics 917
Define the function uðxÞ ¼ 1 if x ≥ 0; and in which the general inverse of the covariance
uðxÞ ¼ 0 if x < 0 and note that matrix is employed. The statistic TM is invariant
with respect to choices of the general inverses.
nij
XXX When the design is balanced, the test statistic is
Rijn ¼ uðXijn Xi0 j0 n0 Þ: equivalent to the Hora–Conover statistic. This sta-
0 0 0
i j n
tistic TM converges to a central χ2J 1 as N → ∞:
Thus, the overall rankings do not adjust for
Test for Nested Effects
different sample sizes in unbalanced designs. To
address this problem, we define the notion of Often, practitioners might speculate that the dif-
a weighted rank. ferent factors might not act separately on the
response and interactions might exist between the
Definition factors. In light of such a consideration, one could
study an unbalanced two-way layout with an inter-
Let ¼ Xijn ; i ¼ 1; . . . ; I; j ¼ 1; . . . ; J; n ¼ 1;
. . . ; nij g be a collection of random variables. The action effect. Let Xijn ði ¼ 1; .P . . ; I;
P j ¼ 1; . . . ; J;
weighted rank of Xijn within this set is n ¼ 1; . . . ; nij Þ be a set of N ¼ i j nij indepen-
" # dent random variables with the model
NX 1 X
Rijn ¼ uðXijn Xi0 j0 n0 Þ ; Xijn ¼ θ þ αi þ βj þ γ ij þ εijn ; ð2Þ
IJ 0 0 ni0 j0 0
ij n
X where i and j index levels
where N ¼ nij : P for factors
P APand B,
respectively. Assume i αi ¼ j βj ¼ i γ ij ¼
ij P
γ
j ij ¼ 0 and ε ijn are independent and identically
Define SN* ¼ ½SN* ðjÞ; j ¼ 1; . . . ; J to be a distributed (i.i.d.) random variables with absolute
vector of weighted linear rank statistics with continuous cumulative distribution function (cdf)
P Pnij
components SN ðjÞ ¼ IJ1 i n1 n¼1 Rijn : Let F. Let δij ¼ βj þ γ ij : To test for nested effect, we
ij
* P consider testing H0 : δij ¼ 0, 8i and j; versus
SN ¼ 1J j SN* ðjÞ: Denote the covariance matrix of H1 : δij 6¼ 0; for some i and j: The nested effect can
SN* as Σ ¼ ðσ b;b0 Þ; with b; b0 ¼ 1; . . . ; J: To estimate be viewed as the combined overall effect of the
Σ; We construct a variable Cbijn treatment either through its own main effect or
through its interaction with the block factor.
1 The same technique of using weighted rank can
Cbijn ¼ ðRijn =NÞ; j 6¼ b be applied in testing for nested effects. Define
IJ2 ρ ij
J1
ð1Þ SN* ði; jÞ ¼ 1=nij P R * and let SN* be the IJ vector
ðR =NÞ; j ¼ b · n ijn
IJ2 ρib ibn of ½SN* ði; jÞ; 1 ≤ i ≤ I; 1 ≤ j ≤ J: Construct a contrast
0 P P P matrix B ¼ II ðIJ 1J JJ Þ; such that the ij element
Let σ^ N ðb; b Þ ¼ i j n ðCbijm C b Þ2 ; and P
0 P P P 0
ij ·
0
of BSN* is SN* ði; jÞ 1J Jb¼1 ði; bÞ: Let Γ denote the
b b b
σ^ N ðb; b Þ ¼ i j n0 ðCijn Cij · ÞðCijn C b Þ:
0
ij ·
0
covariance matrix of SN* : To facilitate the estima-
Let Σ ^ N be the J × J matrix of ½^σ N ðb; b Þ; b; b ¼ tion of Γ; we define the following variables:
*
1; . . . ; J: The fact that converges to HðXijn Þ
Rijn =N
almost surely leads to the fact that under H0 ; 8j; Cði;jÞ ðXabn Þ ¼
8
N ðΣN ΣN Þ → 0 a.s. elementwise.
1 ^ nij
>
>
> N X
Construct a contrast matrix A ¼ IJ 1J JJ : The >
> uðXabn Xijk Þ for ða; bÞ 6¼ ði; jÞ:
< IJnab nij k¼1
generalized Hora–Conover statistic proposed by
n0 0
Xin Gao and Mayer Alvo for the main effects in >
> N P 1 X ij
>
>
unbalanced designs takes the form : lIJnij 0 0 n 0 0
> uðXijn Xi0 j0 n0 Þ for ða; bÞ ¼ ði; jÞ
ði ;j Þði;jÞ i j 0
n ¼1
0
TM ¼ ^ N A0 Þ ðAS * Þ;
ðASN* Þ ðAΣ ð3Þ
N
918 Nonparametric Statistics
Let Γ^ N be the IJ matrix with elements ordinal data. Compared with the classic ANOVA
X models, this nonparametric framework is different
0 0
γ^ N ½ði; jÞ; ði ; j Þ ¼ ðCði;jÞ ÞðXabn Þ in two aspects: First, the normality assumption is
a;b;n relaxed; second, it not only includes the commonly
ði;jÞ
0 0 0 0
ði ;j Þ ðXab Þ; used location models but also encompasses other
C ½Xab · ½Cði ;j Þ ðXabn Þ C
arbitrary models with different cells having differ-
ði;jÞ ðXab · Þ ¼ 1 P Cði;jÞ ðXabn Þ: It can be ent distributions. Under this nonparametric setting,
where C the hypotheses can be formulated in terms of lin-
pffiffiffiffiffi na b n
proved that 1= N ð^ γ N γ N Þ → 0 almost surely ear contrasts of the distribution functions. Accord-
elementwise. The proposed test statistic TN for ing to Akritas and Arnold’s method, Fij can be
nested effects takes the form decomposed as follows:
0
^ N B0 Þ ðBS * Þ:
ðBSN* Þ ðBΓ Fij ðyÞ ¼ MðyÞ þ Ai ðyÞ þ Bj ðyÞ þ Cij ðyÞ;
N
P P P
Under H0 : δij ¼ 0, 8i, j, the proposed statistic where i Ai ¼ j Bj ¼ 0; and i Cij ¼ 0; for all j;
P
TN converges to a central χ2IðJ1Þ as N → ∞: and j Cij ¼ 0; for all i: It follows that
M ¼ F · · ; Ai ¼ Fi · M; Bj ¼ F · j M; and Cij ¼
Tests for Pure Nonparametric Models Fij Fi · F · j þ M; where the subscript ‘‘ · ’’ denotes
summing over all values of the index. Denote the
The previous discussion has been focused on the treatment factor as factor A and the block factor as
linear model with the error distribution unspeci- factor B. The overall nonparametric hypotheses
fied. To reduce the model assumption, Michael of no treatment main effects and no treatment
G. Akritas, Steven F. Arnold, and Edgar Brunner simple factor effects are specified as follows:
have proposed a nonparametric framework in
which the structures of the designs are no longer H0 ðAÞ : Fi · F · · ¼ 0; 8i ¼ 1; . . . ; I;
restricted to linear location models. The nonpara-
H0 ðA|BÞ : Fij F · j ¼ 0; 8i ¼ 1; . . . ; I:; 8j
metric hypotheses are formulated in terms of linear
contrasts of normalized distribution functions. ¼ 1; . . . ; J:
One advantage of the nonparametric hypotheses is
that the parametric hypotheses in linear models The hypothesis H0 ðA|BÞ implies that the
are implied by the nonparametric hypotheses. Fur- treatment has no effect on the response either
thermore, the nonparametric hypotheses are not through the main effects or through the interaction
restricted to continuous distribution functions and effects.
therefore any models with discrete observations This framework especially accommodates the
might also be included in this setup. analysis of interaction effects in a unified manner. In
Under this nonparametric setup, the response literature, testing interactions using ranking methods
variables in a two-way unbalanced layout with I have been a controversial issue for a long time.
treatments and J blocks can be described by the The problem pertaining to the analysis of interac-
following model: tion effects is because the interaction effects based
on cell means can be artificially removed or intro-
Xijn : Fij ðxÞ; i ¼ 1; . . . ; I; j ¼ 1; . . . ; J; duced after certain nonlinear transformations. As
ð2Þ rank statistics are invariant to nonlinear monotone
n ¼ 1; . . . ; nij ;
transformations, they cannot be used to test for
hypotheses that are not invariant to monotone
where Fij ðxÞ ¼ 12 ½Fijþ ðxÞ þ Fij ðxÞ denotes the
transformations. To address this problem, Akritas
normalized-version of the distribution function, and Arnold proposed to define nonparametric inter-
Fijþ ðxÞ ¼ PðXijn ≤ xÞ denotes the right continuous action effects in terms of linear contrasts of the
version, and Fijn ðxÞ ¼ PðXijn < xÞ denotes the left distribution functions. Such nonparametric formula-
continuous version. The normalized version of the tion of interaction effects is invariant to monotone
distribution function accommodates both ties and transformations. The nonparametric hypothesis of
Nonparametric Statistics 919
scales, the parametric approach of comparing two converges to H * ðXijk Þ almost surely leads to the
treatments based on means is not applicable on ordi- ^ a:s: 0 elementwise. Therefore,
result that V V−→
nal scales. However, the points of an ordinal scale we consider the test statistic
can be ordered by size, which can be used to form pffiffiffiffiffi 0 0
T ¼ Nπ ^ 0 Þ C^
^ C ðCVC π; where ðCVC ^ 0 Þ denotes
the estimate of the nonparametric R relative effects. ^ 0 Þ: According to Slut-
the general inverse of ðCVC
Let π ¼ ðπ11 ; . . . ; πIJ Þ0 ¼ H dF denote the
sky’s theorem, because V ^ is consistent, we have
vector of the relative effects. The relative effects πij
d
can be estimated by replacing the distribution T−→ χ2f ; where the degrees of freedom
functions Fij ðxÞ by their empirical counterparts f ¼ rankðCÞ:
920 Nonparametric Statistics for the Behavioral Sciences
See also Distribution; Normal Distribution; Sidney Siegel (January 4, 1916–November 29,
Nonparametric Statistics for the Behavioral Sciences; 1961) was a psychologist trained at Stanford Uni-
Null Hypothesis; Research Hypothesis versity. He spent nearly his entire career as a pro-
fessor at Pennsylvania State University. He is
known for his contribution to nonparametric sta-
Further Readings tistics, including the development with John Tukey
of the Siegel–Tukey test—a test for differences in
Akritas, M. G., & Arnold, S. F. (1994). Fully scale between groups. Arguably, he is most well
nonparametric hypotheses for factorial designs I: known for his book, Nonparametric Statistics for
Multivariate repeated measures designs. Journal of
the Behavioral Sciences, the first edition of which
American Statistical Association, 89; 336–343.
Akritas, M., Arnold, S., & Brunner, E. (1997).
was published by McGraw-Hill in 1956. After Sie-
Nonparametric hypotheses and rank statistics for gel’s death, a second edition was published (1988)
unbalanced factorial designs. Journal of American adding N. John Castellan, Jr., as coauthor. Non-
Statistical Association, 92; 258–265. parametric Statistics for the Behavior Sciences is
Akritas, M. G., & Brunner, E. (1997). A unified approach the first text to provide a practitioner’s introduc-
to rank tests in mixed models. Journal of Statistical tion to nonparametric statistics. By its copious use
Planning and Inferences, 61; 249–277. of examples and its straightforward ‘‘how to’’
Brunner, E., Puri, M. L., & Sun, S. (1995). approach to the most frequently used nonpara-
Nonparametric methods for stratified two-sample metric tests, this text was the first accessible
designs with application to multi-clinic trials. Journal
introduction to nonparametric statistics for the
of American Statistical Association, 90; 1004–1014.
Brunner, E., & Munzel, U. (2002). Nichtparametrische
nonmathematician. In that sense, it represents an
datenanalyse. Heidelberg, Germany: Springer-Verlag. important step forward in the analysis and presen-
Conover, W. J., & Iman, R. L. (1976). On some tation of non-normal data, particularly in the field
alternative procedures using ranks for the analysis of of psychology.
experimental designs. Communication in Statistics, The organization of the book is designed to
A5; 1349–1368. assist the researcher in choosing the correct non-
Domhof, S. (2001). Nichtparametrische relative effekte parametric test. After the introduction, the second
(Unpublished dissertation). University of G€ ottingen, chapter introduces the basic principles of hypothe-
G€ottingen, Germany. sis testing, including the definitions of: the null and
Gao, X., & Alvo, M., (2005). A unified nonparametric
alternative hypothesis, the size of the test, Type I
approach for unbalanced factorial designs. Journal of
American Statistical Association, 100; 926–941.
and Type II errors, power, sampling distributions,
Hora, S. C., & Conover, W. J. (1984). The F statistic in and the decision rule. Chapter 3 describes the fac-
the two-way layout with rank-score transformed data. tors that influence the choice of correct test. After
Journal of the American Statistical Association, 79; explaining some common parametric assumptions
668–673. and the circumstances under which nonparametric
Nonprobability Sampling 921
tests should be used, the text gives a basic outline list of steps for performing the test, and other
of how the proper statistical test should be chosen. references for a more in-depth description of the
Tests are distinguished from one another in two test.
important ways: First, tests are distinguished by
their capability of analyzing data of varying levels Gregory Michaelson and Michael Hardin
of measurement. For example, the χ2 goodness-of-
See also Distribution; Nonparametric Statistics; Normal
fit test can be applied to nominal data, whereas
Distribution; Null Hypothesis; Research Hypothesis
the Kolmogorov–Smirnov requires at least the
ordinal level of measurement. Second, tests are dis-
tinguished in terms of the type of samples to be Further Readings
analyzed. For example, two-sample paired tests Siegel, S. (1956). Nonparametric statistics for the
are distinguished from tests applicable to k inde- behavioral sciences. New York: McGraw-Hill.
pendent samples, which are distinguished tests of
correlation, and so on. Tests included in the text
include the following: the binomial test, the sign
test, the signed-rank test, tests for data displayed NONPROBABILITY SAMPLING
in two-way tables, the Mann–Whitney U test, the
Kruskal–Wallis test, and others. Also included are The two kinds of sampling techniques are proba-
extensive tables of critical values for the various bility and nonprobability sampling. Probability
tests discussed in the text. sampling is based on the notion that the people or
Because nonparametric tests make fewer assum- events chosen are selected because they are repre-
ptions than parametric tests, they are generally less sentative of the entire population. Nonprobability
powerful than the parametric alternatives. The text refers to procedures in which researchers select
compares the various tests presented with their their sample elements not based on a predeter-
parametric analogues in terms of power efficiency. mined probability. This entry examines the appli-
Power efficiency is defined to be the percent cation, limitations, and utility of nonprobability
decrease in sample size required for the parametric sampling procedures. Conceptual and empirical
test to achieve the same power as that of the non- strategies to use nonprobability sampling tech-
parametric test when the test is performed on data niques more effectively are also discussed.
that do, in fact, satisfy the assumptions of the
parametric test.
This work is important because it seeks to pre- Sampling Procedures
sent nonparametric statistics in a way that is
Probability Sampling
‘‘completely intelligible to the reader whose mathe-
matical training is limited to elementary algebra’’ There are many different types of probability
(Siegel, 1956, p. 4). It is replete with examples to sampling procedures. More common ones include
demonstrate the application of these tests in con- simple, systematic, stratified, multistage, and clus-
texts that are familiar to psychologists and other ter sampling. Probability sampling allows one to
social scientists. The text is organized so that the have confidence that the results are accurate and
user, knowing the specific level of measurement unbiased, and it allows one to estimate how pre-
and type(s) of samples being analyzed, can imme- cise the data are likely to be. The data from a prop-
diately identify several nonparametric tests that erly drawn sample are superior to data drawn
might be applied to his or her data. from individuals who just show up at a meeting or
Included in each test is a description of its func- perhaps speak the loudest and convey their per-
tion (under what circumstances this particular test sonal thoughts and sentiments. The critical issues
should be used), rationale, and method (a heuristic in sampling include whether to use a probability
description of why the test works and how the test sample, the sampling frame (the set of people that
statistic is calculated) including any modifications have a chance of being selected and how well it
that exist and the procedure for dealing with ties, corresponds to the population studied), the size of
both large and small sample examples, a numbered the sample, the sample design (particularly the
922 Nonprobability Sampling
strategy used to sample people, schools, house- subjects from a single site or agency is one of the
holds, etc.), and the response rate. The details of most popular methods among studies using
the sample design, including size and selection pro- nonprobability procedures. The barriers prevent-
cedures, influence the precision of sample estimates ing a large-scale multisite collaboration among
regarding how likely the sample is to approximate researchers can be formidable and difficult to
population characteristics. The use of standardized overcome.
measurement tools and procedures also helps to
assure comparable responses.
Statistical Theories About Sampling Procedures
Because a significant number of studies employ
Nonprobability Sampling
nonprobability samples and at the same time apply
Nonprobability sampling is conducted without inferential statistics, it is important to understand
the knowledge about whether those chosen in the the consequences. In scientific research, there are
sample are representative of the entire population. many reasons to observe elements of a sample
In some instances, the researcher does not have rather than a population. Advantages of using
sufficient information about the population to sample data include reduced cost, greater speed,
undertake probability sampling. The researcher greater scope, and greater accuracy. However,
might not even know who or how many people or there is no reason to use a biased sample that does
events make up the population. In other instances, not represent a target population. Scientists have
nonprobability sampling is based on a specific given this topic a rigorous treatment and devel-
research purpose, the availability of subjects, or oped several statistical theories about sampling
a variety of other nonstatistical criteria. Applied procedures. Central limit theorem, which is usually
social and behavioral researchers often face chal- given in introductory statistics courses, forms the
lenges and dilemmas in using a random sample, foundation of probability-sampling techniques. At
because such samples in a real-world research are the core of this theorem are the proven relation-
‘‘hard to reach’’ or not readily available. Even if ships between the mean of a sampling distribution
researchers have contact with hard to reach sam- and the mean of a population, between the stan-
ples, they might be unable to obtain a complete dard deviation of a sampling distribution (known
sampling frame because of peculiarities of the as standard error) and the standard deviation of
study phenomenon. This is especially true when a population, and between the normal sampling
studying vulnerable or stigmatized populations, distribution and the possible non-normal popula-
such as children exposed to domestic violence, tion distribution.
emancipated foster care youth, or runaway teen- Statisticians have developed various formulas
agers. Consider for instance the challenges of sur- for estimating how closely the sample statistics are
veying adults with the diagnosis of paranoid clustered around the population true values under
personality disorder. This is not a subgroup that is various types of sampling designs, including simple
likely to agree to sign a researcher’s informed con- random sampling, systematic sampling, stratified
sent form, let alone complete a lengthy battery of sampling, clustered sampling, and multistage clus-
psychological instruments asking a series of per- tered and stratified sampling. These formulas
sonal questions. become the yardstick for determining adequate
Applied researchers often encounter other prac- sample size. Calculating the adequacy of probabil-
tical dilemmas when choosing a sampling method. istic samples size is generally straightforward and
For instance, there might be limited research can be estimated mathematically based on prese-
resources. Because of limitations in funding, time, lected parameters and objectives (i.e., x statistical
and other resources necessary for conducting power with y confidence intervals). In practice,
large-scale research, researchers often find it diffi- however, sampling error—the key component
cult to use large samples. Researchers employed by required by the formulas for figuring out a needed
a single site or agency might be unable to access sample size—is often unknown to researchers. In
subjects served by other agencies located in other such instances, which often involve quasi-experi-
sites. It is not a coincidence that recruiting study mental designs, Jacob Cohen’s framework of
Nonprobability Sampling 923
statistical power analysis is employed instead. This George Judge et al. caution researchers to be
framework concerns the balance among four ele- aware of assumptions embedded in the statistical
ments of a study: sample size, effect size or differ- models they employ, to be sensitive to departures
ence between comparison groups, probability of of data from the assumptions, and to be willing to
making a Type I error, and probability of denying take remedial measures. John Neter et al. recom-
a false hypothesis or power. Studies using small mend that researchers always perform diagnostic
nonprobability samples, for example, are likely to tests to investigate departures of data from the sta-
have an inadequate power (significantly below .85 tistical assumptions and take corrective measures
convention indicating an adequate power). As if detrimental problems are present. In theory, all
a consequence, studies employing sophisticated research should use probabilistic sampling meth-
analytical models might not meet the required sta- odology, but in practice this is difficult especially
tistical criteria. The ordinary least-square reg- for hard to reach, hidden, or stigmatized popula-
ression model, for example, makes five statistical tions. Much of social science research can hardly
assumptions about data, and most of the be performed in a laboratory. It is important to
assumptions require a randomized process for stress that the results of the study are meaningful if
data gathering. Violating statistical assumptions they are interpreted appropriately and used in
in a regression analysis refers to the presence of conjunction with statistical theories. Theory,
one or more detrimental problems such as het- design, analysis, and interpretation are all con-
eroscedasticity, autocorrelation, non-normality, nected closely.
multicollinearity, and others. Multicollinearity Researchers are also advised to study compel-
problems are particularly likely to occur in non- ling populations and compelling questions. This
probability studies in which data were gathered most often involves purposive samples in which
through a sampling procedure with hidden selec- the research population has some special signifi-
tion bias and/or with small sample sizes. Violat- cance. Most commonly used samples, particularly
ing statistical assumptions might increase the in applied research, are purposive. Purposive sam-
risk of producing biased and inefficient estimates pling is more applicable in exploratory studies and
of regression coefficients and exaggerated R2 . studies contributing new knowledge. Therefore,
it is imperative for researchers to conduct a thor-
ough literature review to understand the ‘‘edge
of the field’’ and whether the study population
Guidelines and Recommendations
or question is a new or significant contribution.
This brief review demonstrates the importance How does this study contribute uniquely to the
of using probability sampling; however, proba- existing research knowledge? Purposive samples
bility sampling cannot be used in all instances. are selected based on a predetermined criteria
Therefore, the following questions must be related to the research. Research that is field ori-
addressed: Given the sampling dilemmas, what ented and not concerned with statistical general-
should researchers do? How can researchers izability often uses nonprobabilistic samples.
using nonprobability sampling exercise caution This is especially true in qualitative research
in reporting findings or undertake remedial mea- studies. Adequate sample size typically relies on
sures? Does nonprobability sampling necessarily the notion of ‘‘saturation,’’ or the point in which
produce adverse consequences? It is difficult to no new information or themes are obtained from
offer precise remedial measures to correct the the data. In qualitative research practice, this
most commonly encountered problems associ- can be a challenging determination.
ated with the use of nonprobability samples Researchers should also address subject recruit-
because such measures vary by the nature of ment issues to reduce selection bias. If possible,
research questions and type of data researchers researchers should use consecutive admissions
employ in their studies. Instead of offering spe- including all cases during a representative time
cific measures, the following strategies are frame. They should describe the population in
offered to address the conceptual and empirical greater detail to allow for cross-study comparisons.
dilemmas in using nonprobability samples. Other researchers will benefit from additional data
924 Nonprobability Sampling
and descriptors that provide a more comprehensive and David L. Hussey have shown that a homoge-
picture of the characteristics of the study popula- neous sample produced by nonprobability sam-
tion. It is critical in reporting results (for both pling is better than a less homogeneous sample
probability and nonprobability sampling) to tell produced by probability sampling in prediction.
the reader who was or was not given a chance to Remember, regression is a leading method used by
be selected. Then, to the extent that is known, applied researchers employing inferential statistics.
researchers should tell the reader how those omit- Regression-type models also include simple linear
ted from the study the same or different from those regression, multiple regression, logistic regres-
included. Conducting diagnostics comparing omit- sion, structural equation modeling, analysis of
ted or lost cases with the known study subjects can variance (ANOVA), multivariate analysis of vari-
help in this regard. Ultimately, it is important for ance (MANOVA), and analysis of covariance
researchers to indicate clearly and discuss the limits (ANCOVA). In a regression analysis, a residual
of generalizability and external validity. is defined as the difference between the observed
In addition, researchers are advised to make value and model-predicted value of the depen-
efforts to assure that a study sample provides ade- dent variable. Researchers are concerned about
quate statistical power for hypothesis testing. It this measure because it is the model with the
has been shown that other things being equal, smallest sample residual that gives the most
a large sample always produces more efficient and accurate predictions about sample subjects. Sta-
unbiased estimates about population true para- tistics such as Theil’s U can gauge the scope of
meters than a small sample. When the use of a non- sample residuals, which is a modified version of
probability sample is inevitable, researchers should root-mean-square error measuring the magnitude
carefully weigh the pros and cons that are associ- of the overall sample residual. The statistic ranges
ated with different study designs and choose a sam- from zero to one, with a value closer to zero indi-
ple size that is as large as possible. cating a smaller overall residual. In this regard,
Another strategy is for researchers to engage in nonprobability samples can be more homogeneous
multiagency research collaborations that generate than a random sample. Using regression coeffi-
samples across agencies and/or across sites. In one cients (including an intercept) to represent study
study, because of limited resources, Brent Benda subjects, it is much easier to obtain an accurate
and Robert Flynn Corwyn found it unfeasible to estimate for a homogeneous sample than for a het-
draw a nationally representative sample to test the erogeneous sample. This consequence, therefore, is
mediating versus moderating effects of religion on that small homogeneous samples generated by
crime in their study. To deal with the challenge, a nonprobability sampling procedure might pro-
they used a comparison between two carefully duce more accurate predictions about sample sub-
chosen sites: random samples selected from two jects. Therefore, if the task is not to infer statistics
public high schools involving 360 adolescents in from sample to population, using a nonprobability
the inner city of a large east coast metropolitan sample is a better strategy than using a probability
area and simple random samples involving 477 sample.
adolescents from three rural public high schools in With the explosive growth of the World Wide
an impoverished southern state. The resultant data Web and other new electronic technologies such
undoubtedly had greater external validity than as search monkeys, nonprobability sampling
studies based on either site alone. remains an easy way to obtain feedback and col-
If possible, researchers should use national sam- lect information. It is convenient, verifiable, and
ples to run secondary data analyses. These data- low cost, particularly when compared with face-
bases were created by probability sampling and to-face paper and pencil questionnaires. Along
are deemed to have a high degree of representa- with the benefits of new technologies, however,
tiveness and other desirable properties. The draw- the previous cautions apply and might be even
back is that these databases are likely to be useful more important given the ease with which larger
for only a minority of research questions. samples might be obtained.
Finally, does nonprobability sampling necessar-
ily produce adverse consequences? Shenyang Guo David L. Hussey
Nonsignificance 925
See also Naturalistic Inquiry; Probability Sampling; data. Emphasis is placed on the three routes to
Sampling; Selection nonsignificance: a real lack of effect in the popula-
tion; failure to detect a real effect because of an
insufficiently large sample; or failure to detect
Further Readings a real effect because of a methodological flaw. Of
Benda, B. B., & Corwyn, R. F. (2001). Are the effects of greatest importance is the recognition that non-
religion on crime mediated, moderated, and significance is not affirmative evidence of the
misrepresented by inappropriate measures? Journal of absence of an effect in the population.
Social Service Research, 27, 57–86. Nonsignificance is the determination in NHST
Cochran, W. G. (1977). Sampling techniques (3rd ed.). that no statistically significant effect (e.g., correla-
New York: Wiley. tion, difference between means, and dependence of
Cohen, J. (1977). Statistical power analysis for the
proportions) can be inferred for a population.
behavioral sciences. New York: Academic Press.
Guest, G., Bunce, A., & Johnson, L. (2006). How many
NHST typically involves statistical testing (e.g., t
interviews are enough? An experiment with data test) performed on a sample to infer whether two
saturation and variability. Field Methods, 18, 59–82. or more variables are related in a population.
Guo, S., & Hussey, D. (2004). Nonprobability sampling Studies often have a high probability of failing to
in social work research. Journal of Social Service reject a false null hypothesis (i.e., commit a Type
Research, 30, 1–18. II, or false negative, error), thereby returning a non-
Judge, G. G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., significant result even when an effect is present in
& Lee, T. C. (1985). The theory and practice of the population.
econometrics (2nd ed.). New York: Wiley. In its most common form, a null hypothesis
Kennedy, P. (1985). A guide to econometrics (2nd ed.).
posits that the means on some measurable variable
Cambridge: MIT Press.
Kish, L. (1965). Survey sampling. New York: Wiley.
for two groups are equal to each other. A statisti-
Mech, E. V., & Che-Man Fung, C. (1999). Placement cally significant difference would indicate that the
restrictiveness and educational achievement among probability that a true null hypothesis is errone-
emancipated foster youth. Research on Social Work ously rejected (Type I, or false positive, error) is
Practice, 9; 213–228. below some desired threshold (αÞ, which is typi-
Neter, J., Kutner, M. H., Nachtsheim, C. J., & cally .05. As a result, statistical significance refers
Wasserman, W. (1996). Applied linear regression to a conclusion that there likely is a difference in
models (3rd ed.). Chicago: Irwin. the means of the two population groups. In con-
Nugent, W. R., Bruley, C., & Allen, P. (1998). The effects trast, nonsignificance refers to the finding that the
of aggression replacement training on antisocial
two means do not significantly differ from each
behavior in a runaway shelter. Research on Social
Work Practice, 8; 637–656.
other (a failure to reject the null hypothesis).
Peled, E., & Edleson, J. L. (1998). Predicting children’s Importantly, nonsignificance does not indicate that
domestic violence service participation and the null hypothesis is true, it only indicates that
completion. Research on Social Work Practice, 8; one cannot rule out chance and random variation
698–712. to explain observed differences. In this sense,
Rubin, A., & Babbie, E. (1997). Research methods for NHST is analogous to an American criminal trial,
social work (3rd ed.). Pacific Grove, CA: Brooks/Cole. in which there is a presumption of innocence
Theil, H. (1966). Applied economic forecasting. (equality), the burden of proof is on demonstrating
Amsterdam, the Netherlands: Elsevier. guilt (difference), and a failure to convict (reject
the null hypothesis) results only in a verdict of
‘‘not guilty’’ (not significant), which does not con-
fer innocence (equality).
NONSIGNIFICANCE Nonsignificant findings might reflect accurately
the absence of an effect or might be caused by
This entry defines nonsignificance within the con- a research design flaw leading to low statistical
text of null hypothesis significance testing (NHST), power and a Type II error. Statistical power is
the dominant scientific statistical method for mak- defined as the probability of detecting an existing
ing inferences about populations based on sample effect (rejecting a false null hypothesis) and might
926 Normal Distribution
be calculated ex ante given the population effect studies with significant findings, contributing to
size (or an estimate thereof), the desired signifi- a documented upward bias in effect sizes in pub-
cance level (e.g., .05), and the sample size. lished studies. Directional hypotheses might also
Type II errors resulting from insufficient statisti- be an issue. For accurate determination of signifi-
cal power can result from several factors. First, cance, directionality should be specified in the
small samples yield lower power because they are design phase, as directional hypotheses (e.g., pre-
simply less likely than large samples to be repre- dicting that one particular mean will be higher
sentative of the population, and they lead to larger than the other) have twice the statistical power of
estimates of the standard error. The standard error nondirectional hypotheses, provided the results are
is estimated as the sample standard deviation in the hypothesized direction.
divided by the square root of the sample size. Many of these problems can be avoided with
Therefore, the smaller the sample, the bigger the a careful research design that incorporates a suffi-
estimate of the standard error will be. Because the ciently large sample based on an a priori power
standard error is the denominator in significance analysis. Other suggestions include reporting effect
test equations, the bigger it is, the less likely the sizes (e.g., Cohen’s dÞ and confidence intervals to
test statistic will be large enough to reject the null convey more information about the magnitude
hypothesis. Small samples also contribute to non- of effects relative to variance and how close
significance because sample size (specifically, the the results are to being determined significant.
degrees of freedom that are derived from it) is an Additional options include reporting the power of
explicit factor in calculations of significance levels the tests performed or the sample size needed to
(p values). Low power can also result from impre- determine significance for a given effect size. Addi-
cise measurement, which might result in excessive tionally, meta-analysis—combining effects across
variance. This too will cause the denominator in multiple studies, even those that are nonsignifi-
the test statistic calculation to be large, thereby cant—can provide more powerful and reliable
underestimating the magnitude of the effect. assessments of relations among variables.
Type II error can also result from flawed meth-
odology, wherein variables are operationalized Christopher Finn and Jack Glaser
inappropriately. If variables are not manipulated
See also Null Hypothesis; Power; Power Analysis;
or measured well, the real relationship between
Significance, Statistical; Type II Error
the intended variables will be more difficult to dis-
cern from the data. This issue is often referred to
as construct validity. A nonrepresentative sample, Further Readings
even if it is large, or a misspecified model might
Berry, E. M., Coustere-Yakir, C., & Grover, N. B.
also prevent the detection of an existing effect.
(1998). The significance of non-significance. QJM,
To reduce the likelihood of nonsignificance 91, 647–653.
resulting from Type II errors, an a priori power Cohen, J. (1977). Statistical power analysis for the
analysis can determine the necessary sample size to behavioral sciences. New York: Academic Press.
provide the desired likelihood of rejecting the null Cohen, J. (1994). The earth is round. American
hypothesis if it is false. The suggested convention Psychologist, 49, 997–1003.
for statistical power is .8. Such a level would allow Rosenthal, R. (1979). The ‘‘file drawer problem’’ and
a researcher to say with 80% confidence that no tolerance for null results. Psychological Bulletin, 86;
Type II error had been committed and, in the event 638–641.
of nonsignificant findings, that no effect exists.
Some have critiqued the practice of reporting
significance tests alone, given that with a .05 crite-
rion, determining that a result of .049 is statisti- NORMAL DISTRIBUTION
cally significant, whereas one of .051 is not
artificially dichotomizes the determination of sig- The normal distribution, which is also called
nificance. An overreliance on statistical significance a Gaussian distribution, bell curve, or normal
also results in a bias among published research for curve, is commonly known for its bell shape (see
Normal Distribution 927
μ = 80, σ = 20
μ = 100, σ = 20
μ = 100, σ = 40
Normal curve
Curve actually
fitted to the data
normal distribution, it is possible to determine per- both the positive and negative z scores). In looking
centages (proportions) of cases that have scored up the table for z ¼ 1.5 and z ¼ 0.5, .4332 and
less than 120. First, the calculation of the z score .1915, respectively, are obtained. To determine the
is required: proportion of IQ scores between 85 and 95, .1915
has to be subtracted from .4332; .4332 .1915 ¼
ðX μÞ ð120 100Þ .2417, 24.17% of IQ scores are found between 85
z¼ ¼ ¼ 2:
σ 10 and 95.
For both examples presented here, the propor-
A z score of þ2 means that the score of 120 is 2
tions are estimates based on mathematical calcula-
standard deviations above the mean. To determine
tions and do not represent actual observations.
the percentile for this z score, this value has to be
Therefore, in the previous example, it is estimated
looked up in a table of standardized values of the
that 97.72% of IQ scores are expected to be less
normal distribution (i.e., a z table). The value of 2
than or equal to 120, and it is estimated that
is looked up in a standard normal-curve areas
24.17% are expected to be found between 85 and
table (not presented here, but can be found in most
95, but what would actually be observed might be
statistics textbooks) and the corresponding value
different.
of .4772 is found (this is the area between the
The normal distribution is also important
mean and z value of 2). This means that the proba-
because of its numerous mathematical properties.
bility of observing a score between 100 (the mean
Assuming that the data of interest are normally
in this example) and 120 (the score of interest in
distributed allows researchers to apply different
this example) is 47.72. The standard normal distri-
calculations that can only be applied to data that
bution is symmetrical around its mean so 50% of
share the characteristics of a normal curve. For
all scores are 100 and less. To determine what pro-
instance, many scores such as percentiles, t scores
portion of individuals score below 120, the value
(scores that have been converted to standard
below 100 has to be added to the value between
scores and subsequently modified such that their
100 and 120. Therefore, 50% is added to 47.72%
mean is 50 and standard deviation is 10), and sta-
resulting in 97.72%. Thus, 97.72% of individuals
nines (scores that have been changed to a value
are expected to score worse than or equal to 120.
from 1 to 9 depending on their location in the dis-
Conversely, if interested in determining the propor-
tribution; e.g., a score found in the top 4% of the
tion of individuals who would score better than
distribution is given a value of 9, a score found in
120, .4772 would be subtracted from .5 and this
the middle 20% of the distribution is given a value
would equal .0228, which means that 2.28% of
of 5) are calculated based on the normal distribu-
individuals would be expected to score better than
tion. Many statistics rely on the normal distribu-
or equal to 120.
tion as they are based on the assumption that
Should a person wish to know the proportion
directly observable scores are normally distributed
of people who obtained an IQ between 85 and 95,
or have a distribution that approximates normal-
a series of the previous calculations would have to
ity. Some statistics that assume the variables under
be conducted. In this example, the z scores for 85
study are normally distributed include t; F; and χ2 :
and for 95 would have to be first calculated
Furthermore, the normal distribution can be used
(assuming the same mean and standard deviation
as an approximation for some other distributions.
as the previous example):
To determine whether a given set of data fol-
ðX μÞ ð85 100Þ lows a normal distribution, examination of skew-
For 85 : z ¼ ¼ ¼ 1:5: ness and kurtosis, the Probability-Probability
σ 10
ðX μÞ ð95 100Þ (P-P) plot, or results of normality tests such as
For 95 : z ¼ ¼ ¼ 0:5: the Kolmogorov–Smirnov test, Lilliefors test, and
σ 10
the Shapiro–Wilk test can be conducted. If the
The areas under the negative z scores are the data do not reflect a normal distribution, then
same as the areas under the identical positive z- the researcher has to determine whether a few out-
scores because the standard normal distribution is liers are influencing the distribution of the data,
symmetrical about its mean (not all tables present whether data transformation will be necessary, or
Normal Distribution 931
whether nonparametric statistics will be used to Gauss independently discovered the normal curve
analyze the data, for instance. and its properties at around the same time as de
Many measurements (latent variables) and phe- Laplace and was interested primarily in its applica-
nomena are assumed to be normally distributed tion to errors of observation in astronomy. It was
(and thus can be approximated by the normal consequently extensively used for describing errors.
distribution). For instance, intelligence, weight, Adolphe Quetelet extended the use of the normal
height, abilities, and personality traits can each be curve beyond errors, believing it could be used to
said to follow a normal distribution. However, describe phenomena in the social sciences, not just
realistically, researchers deal with data that come physics. Sir Francis Galton in the late 19th century
from populations that do not perfectly follow extended Quetelet’s work and applied the normal
a normal distribution, or their distributions are curve to other psychological measurements.
not actually known. The Central Limit Theorem
(also known as the second fundamental theorem Adelheid A. M. Nicol
of probability) partly takes care of this problem.
See also Central Limit Theorem; Data Cleaning;
One important element of the Central Limit Theo-
Multivariate Normal Distribution; Nonparametric
rem states that when the sample size is large, the
Statistics; Normality Assumption; Normalizing Data;
sampling distribution of the sample means will
Parametric Statistics; Percentile Rank; Sampling
approach the normal curve even if the population
Distributions
distribution is not normal. This allows researchers
to be less concerned about whether the population
distributions follow a normal distribution or not. Further Readings
These descriptions and applications apply to the
univariate normal distribution (i.e., the normal dis- Hays, W. L. (1994). Statistics. Orlando, FL: Harcourt.
tribution of a single variable). When two (bivariate Hopkins, K. D., & Glass, G. V. (1978). Basic statistics
for the behavioral sciences. Englewood Cliffs, NJ:
normal distribution) or more variables are consid-
Prentice Hall.
ered, the multivariate normal distribution is impor- King, B. M., & Minium, E. M. (2006). Statistical
tant for examining the relation of those variables reasoning in psychology and education. Hoboken, NJ:
and for using multivariate statistics. Wiley.
Whether many variables are actually normally Levin, J., & Fox, J. A. (2006). Elementary statistics in
distributed is a point of debate for many social research. Boston: Allyn & Bacon.
researchers. For instance, the view that certain per- Lewis, D. G. (1957). The normal distribution of
sonality traits are normally distributed can never intelligence: A critique. British Journal of Psychology,
be observed, as the constructs are not actually mea- 48, 98–104.
sured. Many variables are measured using discrete Micceri, T. (1989). The unicorn, the normal curve, and
other improbable creatures. Psychological Bulletin,
rather than continuous scales. Furthermore, large
105, 156–166.
sample sizes are not always obtained, thus the nor- Patel, J. K., & Read, C. B. (1996). Handbook of the
mal curve might not actually fit well those data. normal distribution. New York: Marcel Dekker.
Sigler, S. M. (1986). The history of statistics. The
History of the Normal Distribution measurement of uncertainty before 1900. Cambridge,
MA: Belknap Press of Harvard University Press.
The first known documentation of the normal dis- Snyder, D. M. (1986). On the theoretical derivation of
tribution was written by Galileo in the 17th cen- the normal distribution for psychological phenomena.
tury in his description of random errors found in Psychological Reports, 59, 399–404.
measurements by astronomers. Abraham de Thode, H. (2002). Testing for normality. New York:
Marcel Dekker.
Moivre is credited with its first appearance in his
Wilcox, R. R. (1996). Statistics for the social sciences.
publication of an article in 1733. Pierre Simon de San Diego, CA: Academic Press.
Laplace developed the first general Central Limit Zimmerman, D. W. (1998). Invalidation of parametric
Theorem in the early 1800s (an important element and nonparametric statistical tests by concurrent
in the application of the normal distribution) and violation of two assumptions. Journal of Experimental
described the normal distribution. Carl Friedrich Education, 67, 55–68.
932 Normality Assumption
values is not determined exclusively by random be bisected in the middle by the median, and
variability, and it also might be a result of uniden- both whiskers will be of equal length.
tified systematic influences (or unmeasured predic- A normal quartile plot compares the spacing of
tors of the outcome). the data with that of the normal distribution. If the
The statistical tests assume that the data follow data being examined are approximately normal,
a normal distribution to preserve the tests’ validity. then more observations should be clustered around
When undertaking regression models, the normal- the mean and only a few observations should exist
ity assumption applies to the error term of the in each of the tails. The vertical axis of the plot dis-
model (often called the residuals) and not the orig- plays the actual data whereas the horizontal axis
inal data and, hence, it is often misunderstood in displays the quartiles from the normal distribution
this context. It should be noted that the normality (expected z scores). If the data are normally distrib-
assumption is sufficient, but not necessary, for the uted, the resulting plot will form a straight line with
validity of many hypothesis tests. The remainder a slope of 1. If the line demonstrates an upward
of this entry focuses on the assessment of normal- bending curve, the data are right skewed, whereas
ity and the transformation of data that are not nor- if the line demonstrates a downward bending curve,
mally distributed. the data are left skewed. If the line has an S-shape,
it indicates that the data are kurtotic.
Several common statistical tests were designed to
Assessing Normality
assess normality. These would include, but are not
A researcher can assess for the normality of vari- limited to, the Kolmogorov–Smirnov, the Shapiro–
ables in several ways. To say a variable is nor- Wilk, the Anderson–Darling, and the Lilliefor’s test.
mally distributed indicates that the distribution In each case, the test calculates a test statistic under
of observations for that variable follows the nor- the null hypothesis that the sample is drawn from
mal distribution. So in essence, if you examined a normal distribution. If the associated p value for
the distribution graphically, it would look simi- the test statistic is greater than the selected alpha
lar to the typical bell-shaped normal cure. A his- level then one does not reject the null hypothesis
togram, a box-and-whisker plot, or a normal that the data were drawn from a normal distribu-
quartile plot (often called a Q-Q plot) can be tion. Some tests can be modified to test samples
created to inspect the normality of the data visu- against other statistical distributions. The Shapiro–
ally. With many data analysis packages, a histo- Wilk and the Anderson–Darling tests have been
gram can be requested with the normal noted to perform better with small sample sizes.
distribution superimposed to aid in this assess- With all the tests, small deviations from normality
ment. Other standard measures of distribution can lead to a rejection of the null hypothesis and
exist and include skewness and kurtosis. Skew- therefore should be used and interpreted with
ness refers to the symmetry of the distribution, caution.
in which right-skewed distributions have a long When a violation of the normality assumption
tail pointing to the right and left-skewed distri- is observed, it might be a sign that a better statisti-
butions have a long tail pointing to the left. Kur- cal model can be found. So, exploring why the
tosis refers to peakedness of the distribution. assumption is violated might be fruitful. Non-
A box-and-whisker plot is created with five normality of the error term might indicate that the
numeric summaries of the variables including resulting error is greater than expected under the
the minimum value, the lower quartile, the assumption of true random variability and (espe-
median, the upper quartile, and the maximum cially when the distribution of the data are asym-
value. The box is formed by the lower and upper metrical) might suggest that the observations come
quartile bisected by the median. Whiskers are from more than one ‘‘true’’ underlying population.
formed on the box plot by drawing a line from Additional variables could be added to the model
the lowest edge of the box (lower quartile) to the (or the study) to predict systematically observed
minimum value and the highest edge of the box values not yet in the model, thereby moving more
(upper quartile) to the maximum value. If the information to the linear predictor. Similarly, non-
variable has a normal distribution, the box will normality might reflect that variables in the model
934 Normality Assumption
are incorrectly specified (such as assuming there is variables might improve normality. If after trans-
a linear association between a continuous predic- formation the variable meets the normality ass-
tor variable and the outcome). umption, the transformed variable can be
Research is often concerned with more than substituted in the analysis. Interpretation of
one variable, and with regression analysis or sta- a transformed variable in an analysis needs to be
tistical modeling, the assumption is that the undertaken with caution as the scale of the vari-
combination of variables under study follows able will be related to the transformation and
a multivariate normal distribution. There are no not the original units.
direct tests for multivariate normality, and as Based on Frederick Mosteller and John
such, each variable under consideration is con- Tukey’s Ladders of Power, if a researcher needs to
sidered individually for normality. If all the vari- remove right skewness from the data, then he or
ables under study are normally distributed, then she moves ‘‘down’’ the ladder of power by apply-
another assumption is made that the variables ing a transformation smaller than 1, such as the
combined are multivariate normal. Although this square root, cube root, logarithm, or reciprocal.
assumption is made, it is not always the case that If the researcher needs to remove left skewness
variables are normal individually and collec- from the data, then he or she moves ‘‘up’’ the
tively. Note that when assessing normality in ladder of power by applying a transformation
a regression modeling situation, the assessment larger than 1 such as squaring or cubing.
of normality should be undertaken with the Many analysts have tried other means to
error term (residuals). avoid non-normality of the error term including
categorizing the variable, truncating or eliminat-
Note of Caution: Small Sample Sizes ing extreme values from the distribution of the
original variable, or restricting the study or
When assessing normality with small sample experiment to observations within a narrower
sizes (samples with less than approximately 50 range of the original measure (where the resi-
observations), caution should be exercised. Both duals observed might form a ‘‘normal’’ pattern).
the visual aids (histogram, box-and-whisker plot, None of these are ideal as they might affect the
and normal quartile plot) and the statistical tests measurement properties of the original variable
(Kolmogorov–Smirnov, Shapiro–Wilk, Anderson– and create problems with bias of estimates of
Darling, and Lilliefor’s test) can provide mislead- interest and/or loss of statistical power in the
ing results. Departures from normality are difficult analysis.
to detect with small sample sizes, largely because
of the power of the test. The power of the statisti- Jason D. Pole and Susan J. Bondy
cal tests decreases as the significance level is
See also Central Limit Theorem; Homogeneity
decreased (as the statistical test is made more strin-
of Variance; Law of Large Numbers; Normal
gent) and increases as the sample size increases.
Distribution; Type I Error; Type II Error; Variance
So, with small sample sizes, the statistical tests
will nearly always indicate acceptance of the null
hypothesis even though departures from normal-
ity could be large. Likewise, with large sample
Further Readings
sizes, the statistical tests become powerful, and
often minor inconsequential departures from Box, G. E. P. (1953). Non-normality and tests on
normality would lead the researcher to reject the variances. Biometrika, 40; 318–335.
null hypothesis. Holgersson, H. E. T. (2006). A graphical method for
assessing multivariate normality. Computational
Statistics, 21; 141–149.
What to Do If Data Lumley, T., Diehr, P., Emerson, S., & Chen, L.
Are Not Normally Distributed (2002). The importance of the normality
assumption in large public health data sets.
If, after assessment, the data are not normally Annual Review of Public Health, 23;
distributed, a transformation of the non-normal 151–169.
Normalizing Data 935
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
kyk ¼ 352 þ 362 þ 462 þ 682 þ 702
NORMALIZING DATA pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
¼ 14; 161 ¼ 119:
Researchers often want to compare scores or sets Normalizing With the Norm
of scores obtained on different scales. For exam-
ple, how do we compare a score of 85 in a cooking To normalize y, we divide each element by
contest with a score of 100 on an IQ test? To do ||y|| ¼ 119. The normalized vector, denoted e
y, is
so, we need to ‘‘eliminate’’ the unit of measure- equal to
ment; this operation means to normalize the data. 2 3
35
There are two main types of normalization. The
6 119 7
first type of normalization originates from linear 6 36 7 2 3
6 7 0:2941
algebra and treats the data as a vector in a multidi- 6 7
6 119 7 6 0:3025 7
mensional space. In this context, to normalize the 6 46 7 6 7
data is to transform the data vector into a new ~y = 6 7 6 7
6 119 7 = 6 0:3866 7 : ð3Þ
6 7 4 0:5714 5
vector whose norm (i.e., length) is equal to one. 6 68 7
6 7 0:5882
The second type of normalization originates from 6 119 7
4 70 5
statistics and eliminates the unit of measurement
by transforming the data into new scores with 119
a mean of 0 and a standard deviation of 1. These
transformed scores are known as z scores. The norm of vector ~y is now equal to one:
k ~y k ¼
Normalization to a Norm of One pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0:29412 þ0:30252 þ0:38662 þ0:57142 þ0:58822
The Norm of a Vector pffiffiffi
¼ 1 ¼ 1:
In linear algebra, the norm of a vector measures ð4Þ
its length which is equal to the Euclidean distance
of the endpoint of this vector to the origin of the Normalization Using Centering
vector space. This quantity is computed (from and Standard Deviation: z Scores
the Pythagorean theorem) as the square root of the
sum of the squared elements of the vector. For The Standard Deviation of a Set of Scores
example, consider the following data vector Recall that the standard deviation of a set of
denoted y: scores expresses the dispersion of the scores
3 2 around their mean. A set of N scores, each
35 denoted Yn ; whose mean is equal to M; has a stan-
6 36 7 dard deviation denoted ^S which is computed as
6 7
y¼6 7
6 46 7: ð1Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
4 68 5 P 2
^S ¼ ðYN MÞ
70 : ð5Þ
N1
The norm of vector y is denoted ||y|| and is com- For example, the scores from vector y (see Equation
puted as 4) have a mean of 51 and a standard deviation of
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 2 2 2 2
^ ð35 51Þ þ ð36 51Þ þ ð46 51Þ þ ð68 51Þ þ ð70 51Þ
S¼
51
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð6Þ
1
¼ ð16Þ2 þ 152 þ ð5Þ2 þ 172 þ 192
2
¼ 17:
936 Nuisance Variable
100
80
The mean of vector z is now equal to zero, and 60
40
its standard deviation is equal to one. 20
0
0 1 2 3 4 5 6 7 8 9 10
100
80
Variance; z Score 60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Further Readings Math Accuracy
Frequency
For the first distribution, participants’ math 60
40
accuracy scores are relatively similar and cluster
20
around the mean of the distribution. There are 0
fewer very high scores and fewer very low scores 0 2 4 6 8 10
Frequency
50
40
variable. 30
20
In an experimental study, a nuisance variable 10
0
affects within-group differences for both the treat- 0 2 4 6 8 10
ment group and the control group. When a nuisance Math Accuracy
variable is present, the spread of scores for each Control Treatment
participants’ math accuracy scores. In exercising common forms of a null hypothesis. Third, this
statistical control, a researcher might employ entry articulates several problems with using null
regression techniques to control for any variation hypothesis-based data analysis procedures.
caused by a nuisance variable. However, in this
case, it becomes important to specify and measure
Null Hypothesis Significance
potential nuisance variables before and during the
experiment. In the example used previously, parti- Testing and the Null Hypothesis
cipants could be given a measure of anxiety along Most experiments entail measuring the effect(s) of
with the measure of math accuracy. Multiple some number of independent variables on some
regression models might then be used to statisti- dependent variable.
cally control for the influence of anxiety on math
accuracy scores alongside any experimental treat-
An Example Experiment
ment that might be administered.
The term nuisance variable is often used along- In the simplest sort of experimental design, one
side the terms extraneous and confounding vari- measures the effect of a single independent vari-
able. Whereas an extraneous variable influences able, such as the amount of information held in
differences observed between groups, a nuisance short-term memory on a single dependent variable
variable influences differences observed within and the reaction time to scan through this informa-
groups. By eliminating the effects of nuisance vari- tion. To pick a somewhat arbitrary example from
ables, the tests of the null hypothesis become more cognitive psychology, consider what is known as
powerful in uncovering group differences. a Sternberg experiment, in which a short sequence
of memory digits (e.g., ‘‘34291’’) is read to an
Cynthia R. Davis observer who must then decide whether a single,
See also Confounding; Control Variables; Statistical Control
subsequently presented test digit was part of the
sequence. Thus for instance, given the memory
Further Readings digits above, the correct answer would be ‘‘yes’’ for
a test digit of ‘‘2’’ but ‘‘no’’ for a test digit of ‘‘8.’’
Breaugh, J. A. (2006). Rethinking the control of nuisance The independent variable of ‘‘amount of informa-
variables in theory testing. Journal of Business and
tion held in short-term memory’’ can be implemen-
Psychology, 20, 429–443.
Meehl, P. (1970). Nuisance variables and the ex post
ted by varying set size, which is the number of
facto design. In M. Radner & S. Winokur (Eds.), memory digits presented: In different conditions,
Minnesota studies in the philosophy of science: the set size might be, say, 1, 3, 5 (as in the exam-
Vol. IV. Analyses of theories and methods of physics ple), or 8 presented memory digits. The number of
and psychology (pp. 373–402). Minneapolis: different set sizes (here 4) is more generally referred
University of Minnesota Press. to as the number of levels of the independent vari-
able. The dependent variable is the reaction time
measured from the appearance of the test digit to
the observer’s response. Of interest in general is the
NULL HYPOTHESIS degree to which the magnitude of the dependent
variable (here, reaction time) depends on the level
In many sciences, including ecology, medicine, and of the independent variable (here set size).
psychology, null hypothesis significance testing
(NHST) is the primary means by which the
Sample and Population Means
numbers comprising the data from some experi-
ment are translated into conclusions about the Typically, the principal dependent variable takes
question(s) that the experiment was designed to the form of a mean. In this example, the mean
address. This entry first provides a brief descrip- reaction time for a given set size could be com-
tion of NHST, and within the context of NHST, it puted across observers. Such a computed mean is
defines the most common incarnation of a null called a sample mean, referring to its having been
hypothesis. Second, this entry sketches other less computed across an observed sample of numbers.
Null Hypothesis 939
A sample mean is construed as an estimate of beyond the scope of this entry, but two remarks
a corresponding population mean, which is what about the process are appropriate here.
the mean value of the dependent variable would
be if all observers in the relevant population 1. A major ingredient in the decision is the vari-
were to participate in a given condition of the ability of the Mj s. To the degree that the Mj s are
experiment. Generally, conclusions from experi- close to one another, evidence ensues for possible
ments are meant to apply to population means. equality of the μj s and, ipso facto, validity of the
Therefore, the measured sample means are only null hypothesis. Conversely, to the degree that the
interesting insofar as they are estimates of the Mj s differ from one another, evidence ensues for
corresponding population means. associated differences among the μj s and, ipso
Notationally, the sample means are referred to facto, validity of the alternative hypothesis.
as the Mj s, whereas the population means are 2. The asymmetry between the null hypothesis
referred to as the μj s. For both sample and popula- (which is exact) and the alternative hypothesis
tion means, the subscript ‘‘j’’ indexes the level of (which is inexact) sketched previously implies an
the independent variable; thus, in our example, associated asymmetry in conclusions about their
M2 would refer to the observed mean reaction validity. If the Mj s differ sufficiently, one ‘‘rejects
time of the second set-size level (i.e., set size ¼ 3) the null hypothesis’’ in favor of accepting the
and likewise, μ2 would refer to the corresponding, alternative hypothesis. However, if the Mj s do not
unobservable population mean reaction time cor- differ sufficiently, one does not ‘‘accept the null
responding to set size ¼ 3. hypothesis’’ but rather one ‘‘fails to reject the null
hypothesis.’’ The reason for the awkward, but
Two Competing Hypotheses logically necessary, wording of the latter conclu-
sion is that, because the alternative hypothesis is
NHST entails establishing and evaluating two inexact, one cannot generally distinguish a genu-
mutually exclusive and exhaustive hypotheses inely true null hypothesis on the one hand from
about the relation between the independent vari- an alternative hypothesis entailing small differ-
able and the dependent variable. Usually, and in its ences among the μj s on the other hand.
simplest form, the null hypothesis (abbreviated
H0 Þ is that the independent variable has no effect
Multifactor Designs: Multiple Null Hypothesis–
on the dependent variable, whereas the alternative
Alternative Hypothesis Pairings
hypothesis (abbreviated H1 Þ is that the indepen-
dent variable has some effect on the dependent So far, this entry has described a simple design
variable. Note an important asymmetry between in which the effect of a single independent variable
a null hypothesis and an alternative hypothesis: A on a single dependent variable is examined. Many,
null hypothesis is an exact hypothesis, whereas an if not most experiments, use multiple independent
alternative hypothesis is an inexact hypothesis. By variables and are known as multifactor designs
this it is meant that a null hypothesis can be cor- (‘‘factor’’ and ‘‘independent variable’’ are synony-
rect in only one way, viz, the μj s are all equal to mous). Continuing with the example experiment,
one another, whereas there are an infinite number imagine that in addition to measuring the effects
of ways in which the μj s can be different from one of set size on reaction time in a Sternberg task, one
another (i.e., an infinite number of ways in which also wanted to measure simultaneously the effects
an alternative hypothesis can be true). on reaction time of the test digit’s visual contrast
(informally, the degree to which the test digit
stands out against the background). One might
Decisions Based on Data
then factorially combine the four levels of set
Having established a null and an alternative size (now called ‘‘factor 1’’) with, say, two levels,
hypothesis that are mutually exclusive and exhaus- ‘‘high contrast’’ and ‘‘low contrast,’’ of test-digit
tive, the experimental data are used to—roughly contrast (now called ‘‘factor 2’’). Combining the
speaking—decide between them. The technical four set-size levels with the two test-digit contrast
manner by which one makes such a decision is levels would yield 4 × 2 ¼ 8 separate conditions.
940 Null Hypothesis
Typically, three independent NHST procedures between two independent variables. This kind of
would then be carried out, entailing three null no-effect null hypothesis is by far the most common
hypothesis–alternative hypothesis pairings. They null hypothesis to be found in the literature. Techni-
are as follows: cally however, a null hypothesis can be any exact
hypothesis; that is, the null hypothesis of ‘‘all μj s
1. For the set size main effect: are equal to one another’’ is but one special case of
what a null hypothesis can be.
H0 : Averaged over the two test-digit contrasts,
To illustrate another form, let us continue with
there is no set-size effect the first, simpler Sternberg-task example (set size is
the only independent variable), but imagine that
H1 : Averaged over the two test-digit contrasts, prior research justifies the assumption that the
there is a set-size effect relation between set size and reaction time is lin-
ear. Suppose also that research with digits has
2. For the test-digit contrast main effect: yielded the conclusion that reaction time increases
by 35 ms for every additional digit held in short-
H0 : Averaged over the four set sizes, there is no term memory; that is, if reaction time were plotted
test-digit contrast effect against set size, the resulting function would be
H1 : Averaged over the four set sizes, there is a test- linear with a slope of 35 ms.
digit contrast effect Now, let us imagine that the Sternberg experi-
ment is done with words rather than digits. One
3. For set-size by test-digit contrast interaction: could establish the null hypothesis that ‘‘short-term
memory processing proceeds at the same rate with
Two independent variables are said to interact words as it does with digits’’ (i.e., that the slope of
if the effect of one independent variable depends the reaction time versus set-size function would be
on the level of the other independent variable. As 35 ms for words just as it is known to be with
with the main effects, interaction effects are imme- digits). The alternative hypothesis would then be
diately identifiable with respect to the Mj s; how- ‘‘for words, the function’s slope is anything other
ever, again as with main effects, the goal is to than 35 ms.’’ Again, the fundamental distinction
decide whether interaction effects exist with between a null and alternative hypothesis is that
respect to the corresponding μj s. As with the main the null hypothesis is exact (35 ms/digit), whereas
effects, NHST involves pitting a null hypothesis the alternative hypothesis is inexact (anything else).
against an associated alternative hypothesis. This distinction would again drive the asymmetry
between conclusions, which were articulated previ-
H0 : With respect to the μj s, set size and test-digit ously: a particular pattern of empirical results
contrast do not interact.
could logically allow ‘‘rejection of the null hypothe-
H1 : With respect to the μj s, set size and test-digit sis’’; that is, ‘‘acceptance of the alternative hypothe-
contrast do interact. sis’’ but not ‘‘acceptance of the null hypothesis.’’
A Null Hypothesis Cannot Be Literally True concomitant caution in using the Mj s to make
inferences about the μj s.
In most sciences, it is almost a self-evident truth
None of this is relevant within the process of
that any independent variable must have some
NHST, which does not in any way emphasize the
effect, even if small, on any dependent variable.
degree to which the Mj s are good estimates of the
This is certainly true in psychology. In the Stern-
μj s. In its typical form, NHST allows only a limited
berg task, to illustrate, it is simply implausible that
assessment of the nature of the μj s: Are they all
set size would have literally zero effect on reaction
equal or not? Typically, the ‘‘no’’ or ‘‘not necessarily
time (i.e., that is, that the μj s corresponding to the
no’’ conclusion that emerges from this process is
different set sizes would be identical to an infinite
insufficient to evaluate the totality of what the data
number of decimal places). Therefore, rejecting
might potentially reveal about the nature of the μj s.
a null hypothesis—which, as noted, is the only
An alternative that is gradually emerging within
strong conclusion that is possible within the con-
several NHST-heavy sciences—an alternative that
text of NHST—tells the investigator nothing that
is common in the natural sciences—is the use of
the investigator should have been able to realize
confidence intervals that assess directly how good
was true beforehand. Most investigators do not
is a Mj as an estimate of the corresponding μj :
recognize this, but that does not prevent it from
Briefly, a confidence interval is an interval con-
being so.
structed around a sample mean that, with some
pre-specified probability (typically 95%), includes
Human Nature Makes Acceptance the corresponding population mean. A glance at
of a Null Hypothesis Almost Irresistible a set of plotted Mj s with associated plotted confi-
Earlier, this entry detailed why it is logically for- dence intervals provides immediate and intuitive
bidden to accept a null hypothesis. However, information about (a) the most likely pattern of
human nature dictates that people do not like to the μj s and (b) the reliability of the pattern of Mj s
make weak yet complicated conclusions such as as an estimate of the pattern of μj s. This in turn
‘‘We fail to reject the null hypothesis.’’ Scientific provides immediate and intuitive information both
investigators, generally being humans, are not about the relatively uninteresting question of
exceptions. Instead, a ‘‘fail to reject’’ decision, whether some null hypothesis is true and about the
which is dutifully made in an article’s results sec- much more interesting questions of what the pat-
tion, often morphs into ‘‘the null hypothesis is true’’ tern of μj s actually is and how much belief can be
in the article’s discussion and conclusions sections. placed in it based on the data at hand.
This kind of sloppiness, although understandable, Geoffrey R. Loftus
has led to no end of confusion and general scientific
mischief within numerous disciplines. See also Confidence Intervals; Hypothesis; Research
Hypothesis; Research Question
Null Hypothesis Significance Testing
Emphasizes Barren, Dichotomous Conclusions Further Readings
Earlier, this entry described that the pattern of Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005).
population means—the relations among the unob- Researchers misunderstand confidence intervals and
servable μj s—is of primary interest in most scien- standard error bars. Psychological Methods, 10,
tific experiments and that the observable Mj s are 389–396.
estimates of the μj s. Accordingly, it should be of Cumming, G., Williams, J., & Fidler, F. (2004).
great interest to assess how good are the Mj s as Replication, and researchers’ understanding of
confidence intervals and standard error bars.
estimates of the μj s. If, to use an extreme example,
Understanding Statistics, 3, 299–311.
the Mj s were perfect estimates of the μj s there Fidler, F., Burgman, M., Cumming, G., Buttrose, R., &
would be no need for statistical analysis: The Thomason, N. (2006). Impact of criticism of null
answers to any question about the μj s would be hypothesis significance testing on statistical reporting
immediately available from the data. To the degree practices in conservation biology. Conservation
that the estimates are less good, one must exercise Biology, 20, 1539–1544.
942 Nuremberg Code
Haller, H., & Krauss, S. (2002). Misinterpretations of experiments. However, it was reinterpreted to
significance: A problem students share with their expand medical research in the Declaration of Hel-
teachers? Methods of Psychological Research, 7, 1–20. sinki by the World Medical Association in 1964.
Kline, R. B. (2004). Beyond significance testing: The legal judgment in the Nuremberg trial by
Reforming data analysis methods in behavioral
a panel of American and European physicians and
research. Washington, DC: American Psychological
Association.
scientists contained 10 moral, ethical, and legal
Lecoutre, M. P., Poitevineau, J., & Lecoutre, B. (2003). requirements to guide researchers in experiments
Even statisticians are not immune to with human subjects. These requirements are as
misinterpretations of null hypothesis significance tests. follows: (1) voluntary informed consent based on
International Journal of Psychology, 38, 37–45. legal capacity and without coercion is essential;
Loftus, G. R. (1996). Psychology will be a much better (2) research should be designed to produce results
science when we change the way we analyze data. for the good of society that are not obtainable by
Current Directions in Psychological Science, 5; other means; (3) human subjects research should
161–171.
be based on prior animal research; (4) physical
Rosenthal, R., & Gaito, J. (1963). The interpretation of
and mental suffering must be avoided; (5) no
levels of significance by psychological researchers.
Journal of Psychology, 55, 33–38. research should be conducted for which death or
disabling injury is anticipated; (6) risks should be
justified by anticipated humanitarian benefits;
(7) precautions and facilities should be provided to
protect research subjects against potential injury,
NUREMBERG CODE disability, or death; (8) research should only be
conducted by qualified scientists; (9) the subject
The term Nuremberg Code refers to the set of should be able to end the study during the
standards for conducting research with human research; (10) the scientist should be able to end
subjects that was developed in 1947 at the end the research at any stage if potential for injury, dis-
of World War II, in the trial of 23 Nazi doctors ability, or death of the subject is recognized.
and scientists in Nuremberg, Germany, for war
crimes that included medical experiments on per-
Impact on Human Subjects Research
sons designated as non-German nationals. The
trial of individual Nazi leaders by the Nurem- At the time the Nuremberg Code was formulated,
berg War Crimes Tribunal, the supranational many viewed it as created in response to Nazi
institution charged with determining justice in medical experimentation and without legal author-
the transition to democracy, set a vital precedent ity in the United States and Europe. Some Ameri-
for international jurisprudence. can scientists considered the guidelines implicit in
The Nuremberg Code was designed to protect their human subjects research, applying to non-
the autonomy and rights of human subjects in therapeutic research in wartime. The informed
medical research, as compared with the Hippo- consent requirement was later incorporated into
cratic Oath applied in the therapeutic, paternalistic biomedical research, and physicians continued to
patient-physician relationship. It is recognized as be guided by the Hippocratic Oath for clinical
initiating the modern international human rights research.
movement during social construction of ethical Reinterpretation of the Nuremberg Code in the
codes, with the Universal Declaration of Human Declaration of Helsinki for medical research modi-
Rights in 1954. Human subjects abuse by Nazi fied requirements for informed consent and subject
physicians occurred despite German guidelines recruitment, particularly in pediatrics, psychiatry,
for protection in experimentation, as noted by and research with prisoners. Therapeutic research
Michael Branigan and Judith Boss. Because the was distinguished from nontherapeutic research,
international use of prisoners in research had and therapeutic privilege was legitimated in the
grown during World War II, the Code required patient-physician relationship.
that children, prisoners, and patients in mental However, social and biomedical change, as well
institutions were not to be used as subjects in as the roles of scientists and ethicists in research
Nuremberg Code 943
and technology, led to the creation of an explicitly colleagues note that public health emphasis on com-
subjects-centered approach to human rights. This munity-based participatory research is shifting focus
has been incorporated into research in medicine from protection of individual subjects to ethical
and public health, and in behavioral and social relationships with community members and organi-
sciences. zations as partners.
Yet global clinical drug trials by pharmaceutical
companies and researchers’ efforts to limit restric-
Biomedical, Behavioral, tions on placebo-controlled trials have contributed
to the substitution of ‘‘Good Clinical Practice
and Community Research
Rules’’ for the Declaration of Helsinki by the U.S.
With the enactment of federal civil and patient Food and Drug Administration in 2004. The
rights legislation, the erosion of public trust, and development of rules by regulators and drug indus-
the extensive criticism of ethical violations and dis- try trade groups and the approval by untrained
crimination in the Tuskegee syphilis experiments local ethics committees in developing countries
by the United States Public Health Service (1932– could diminish voluntary consent and benefits for
1971), biomedical and behavioral research with research subjects.
human subjects became regulated by academic and Subsequent change in application of the Nurem-
hospital-based institutional review boards. The berg Code has occurred with the use of prisoners
National Commission for the Protection of in clinical drug trials, particularly for HIV drugs,
Human Subjects of Biomedical and Behavioral according to Branigan and Boss. This practice is
Research was established in the United States in illegal for researchers who receive federal support
1974, with requirements for review boards in insti- but is legal in some states. The inclusion of prison-
tutions supported by the Department of Health, ers and recruitment of subjects from ethnic or
Education and Welfare. racial minority groups for clinical trials might offer
In 1979, the Belmont Report set four ethical them potential benefits, although it must be bal-
principles for human subjects research: (1) benefi- anced against risk and need for justice.
cence and nonmalfeasance, to maximize benefits
and minimize risk of harm; (2) respect for auton- Sue Gena Lurie
omy in decision-making and protection of those
See also Declaration of Helsinki
with limited autonomy; (3) justice, for fair treat-
ment; and (4) equitable distribution of benefits and
risks. The Council for International Organizations Further Readings
of Medical Sciences and the World Health Orga-
nization formulated International Ethical Guide- Beauchamp, D., & Steinbock, B. (1999). New ethics for the
public’s health. Oxford, UK: Oxford University Press.
lines for Biomedical Research Involving Human
Branigan, M., & Boss, J. (2001). Human and animal
Subjects for research ethics committees in 1982. experimentation. In Healthcare ethics in a diverse
Yet in 1987, the United States Supreme Court society. Mountain View, CA: Mayfield.
refused to endorse the Nuremberg Code as bind- Cohen, J., Bankert, E., & Cooper, J. (2006). History and
ing on all research, and it was not until 1997 ethics. CITI course in the protection of human
that national security research had to be based research subjects. Retrieved December 6, 2006, from
on informed consent. http://www.citiprogram.org/members/courseandexam/
Institutional review boards (IRBs) were estab- moduletext
lished to monitor informed consent and avoid risk Elster, J. (2004). Closing the books: Transitional justice in
and exploitation of vulnerable populations. historical perspective. New York: Cambridge
University Press.
Research organizations and Veterans Administra-
Farmer, P. (2005). Rethinking health and human rights.
tion facilities might be accredited by the Association In Pathologies of power. Berkeley: University of
for Accreditation of Human Research Protection California.
Programs. Although IRBs originated for biomedical Levine, R. (1981). The Nuremberg Code. In Ethics and
and clinical research, their purview extends to the regulation of clinical research. Baltimore: Urban &
behavioral and social sciences. Nancy Shore and Schwarzenberg.
944 NVivo
Levine, R. (1996). The Institutional Review Board. In The consideration of NVivo is relevant to the
S. S. Coughlin & T. L. Beauchamp (Eds.), Ethics and practical task of research design in two senses: Its
epidemiology. New York: Oxford University. tools are useful when designing or preparing for
Lifton, R. (2000). The Nazi doctors. New York: Basic a research project, and if it is to be used also for
Books.
the analysis of qualitative or mixed methods data,
Morrison, E. (2008). Health care ethics: Critical issues
for the 21st century. Sudbuy, MA: Jones & Bartlett.
then consideration needs to be given to designing
National commission for the protection of human and planning for its use.
subjects of biomedical and behavioral research.
(1979). The Belmont Report: Ethical principles and
Designing With NVivo
guidelines for the protection of human subjects of
research. Washington, DC: U. S. Department of NVivo can assist in the research design process in
Health, Education and Welfare. (at least) three ways, regardless of the methodolog-
Shah, S. (2006). The body hunters: Testing new drugs on ical approach to be adopted in the research: keep-
the world’s poorest patients. New York: New Press.
ing a research journal, working with literature,
Shore, N., Wong, K., Seifer, S., Grignon, J., & Gamble, V.
and building conceptual models.
(2008). Introduction: Advancing the ethics of
community-based participatory research. Journal of
Empirical Research on Human Research Ethics, 3; 1–4. Keeping a Journal
Keeping a record of decisions made when plan-
ning and conducting a research project, tracking
events occurring during the project (foreseen and
NVIVO unforeseen), or even recording random thoughts
about the project, will assist an investigator to pre-
NVivo provides software tools to assist a researcher pare an accurate record of the methods adopted
from the time of conceptualization of a project for the project and the rationale for those meth-
through to its completion. Although NVivo is soft- ods. Journaling serves also to simulate thinking, to
ware that is designed primarily for researchers prevent loss of ideas that might be worthy of
undertaking analysis of qualitative (text and multi- follow-up, and to provide an audit trail of devel-
media) data, its usefulness extends to researchers opment in thinking toward final conclusions.
engaged in any kind of research. The tools pro- NVivo can work with text that has been
vided by NVivo assist in the following: recorded using Microsoft Word, or a journal can
be recorded progressively within NVivo. Some
• Tracking and management of data sources and researchers keep all their notes in a single docu-
information about these sources ment, whereas others prefer to make several docu-
• Tracking and linking ideas associated with or ments perhaps to separate methodological from
derived from data sources substantive issues. The critical contribution that
• Searching for terms or concepts
NVivo can make here is to assist the researcher in
• Indexing or coding text or multimedia
keeping track of their ideas through coding the
information for easy retrieval
• Organizing codes to provide a conceptual content of the journal. A coding system in NVivo
framework for a study works rather like an index but with a bonus. The
• Querying relationships between concepts, text in the document is highlighted and tagged
themes, or categories with a code (a label that the researcher devises).
• Building and drawing visual models with links Codes might be theoretically based and designed
to data a priori, or they can be created or modified
(renamed, rearranged, split, combined, or content
Particular and unique strengths of NVivo lie in its recoded) in an emergent way as the project pro-
ability to facilitate work involving complex data ceeds. The bonus of using codes in NVivo is that
sources in a variety of formats, in the range of the all materials that have been coded in a particular
query tools it offers, and in its ability to link quan- way can be retrieved together, and if needed for
titative with qualitative data. clarification, any coded segment can be shown
NVivo 945
within its original context. A research journal is concept or perhaps to compare the key concerns
a notoriously messy document, comprising ran- of North American versus European writers. The
dom thoughts, notes of conversations, ideas associations between codes, or between attri-
from reading, and perhaps even carefully consid- butes and codes, could be used to review the
ered strategies that entered in no particular relationship between an author’s theoretical per-
sequence. By coding their journal, researchers spective and his or her understanding of the
can find instantly any thoughts or information likely impact of a planned intervention.
they have on any particular aspect of their Searching text in NVivo provides a useful sup-
project, regardless of how messy the original plement to coding. One could search the reports of
document was. This different view of what has a group of related studies, for example, for the
been written not only brings order out of chaos alternative words ‘‘extraneous OR incidental OR
and retrieves long-forgotten thoughts but also unintended’’ to find anything written about the
prompts deeper thinking and perhaps reconcep- potential impact of extraneous variables on the
tualization of that topic after visualizing all the kind of experiment being planned.
material on one topic together.
NVivo’s coding system can be used also to Researchers often find it useful at the start of
index, retrieve, and synthesize what is learned a project to ‘‘map’’ their ideas about their experi-
from reading across the substantive, theoretical, or mental design or about what they are expecting
methodological literature during the design phase to find from their data gathering. Doing so can
of a project. In the same way that a journal can be help to identify all the factors that will possibly
coded, either notes derived from reading or the impinge on the research process and to clarify
text of published articles can be coded for retrieval the pathways by which different elements will
and reconsideration according to either an a priori impact on the process and its outcomes. As
or emergent system (or a combination thereof). a conceptual or process model is drawn, fresh
Thus, the researcher can locate and bring together awareness of sampling or validity issues might
all their material from any of their references on, be prompted and solutions sought. NVivo pro-
for example, the concept of equity, debates about vides a modeling tool in which items and their
the use of R2 ; or the role of antioxidants in pre- links can be shown. A variety of shapes can be
venting cancer. With appropriate setting up, the used in designing the model. Project items such
author/s and year of publication for any segment as codes, cases, or attributes can be added to the
of coded text can be retrieved alongside the text, model, and where coding is present, these codes
which facilitates the preparation of a written provide a direct link back to the data they repre-
review of the literature. The database created in sent. Labels can be added to links (which might
NVivo becomes available for this and many more be associative, unidirectional, or bidirectional),
projects, and it serves as an ongoing database for and styles (e.g., color, fill, and font) can be used
designing, conducting, and writing up future to emphasize the significance of different items
projects. or links. The items can be grouped so that they
A preliminary review of the literature can be can be turned on or off in the display. The mod-
extended into a more thorough analysis by els can be archived, allowing the researcher to
drawing on NVivo’s tools for recording and continue to modify their model as their under-
using information about sources (referred to standing grows while keeping a historical record
as attributes) in comparative analyses or by of their developing ideas.
examining the relationship between, say, per-
spectives on one topic and what is said about Designing for Analysis With NVivo
another. Recorded information about each refer-
ence could be used, for example, to review Where the intention is to use NVivo for analysis of
changes over time in perspectives on a particular qualitative or mixed-methods data, there are
946 NVivo
that arise as a consequence of asking other ques- Model Design; Observations; Planning Research;
tions (so that results from one question are fed into Qualitative Research
another). Queries can be saved so that they can be
run again, with more data or with a different
Further Readings
subset of data. Relevant text is retrieved for
review and drawing inferences; patterns of coding Bazeley, P. (2006). The contribution of computer
reflected in numbers of sources, cases, or words software to integrating qualitative and quantitative
are available in numeric form or as charts. All cod- data and analyses. Research in the Schools, 13; 64–73.
ing information, including results from queries, Bazeley, P. (2007). Qualitative data analysis with NVivo.
can be exported in numeric form for subsequent Thousand Oaks, CA: Sage.
Richards, L. (2005). Handling qualitative data. Thousand
statistical analysis if appropriate—but always with
Oaks, CA: Sage.
the supporting text readily available in the NVivo
database to give substance to the numbers.
Websites
Pat Bazeley
QSR International: http://www.qsrinternational.com
See also Demographics; Focus Group; Interviewing; Research Support Pty. Limited:
Literature Review; Mixed Methods Design; Mixed http://www.researchsupport.com.au
O
might be maximally inclusive, such as in the case
OBSERVATIONAL RESEARCH of the ethogram, which attempts to provide a com-
prehensive description of all of the characteristic
behavior patterns of a species, or they might be
The observation of human and animal behavior restricted to a much smaller set of behaviors, such
has been referred to as the sine qua non of science, as the social behaviors of jackdaws, as studied
and indeed, any research concerning behavior ulti- by the Nobel Prizewinning ethologist Konrad
mately is based on observation. A more specific Lorenz, or the facial expressions of emotion in
term, naturalistic observation, traditionally has humans, as studied by the psychologist Paul
referred to a set of research methods wherein the Ekman. Thus, the versatile set of measurement
emphasis is on capturing the dynamic or temporal methods referred to as observational research
nature of behavior in the environment where it emphasizes temporally dynamic behaviors as they
naturally occurs, rather than in a laboratory where naturally occur, although the conditions of obser-
it is experimentally induced or manipulated. What vation and the breadth of behaviors observed will
is unique about the more general notion of obser- vary with the research question(s) at hand.
vational research, however, and what has made it Because of the nature of observational research,
so valuable to science is the fact that the process of it is often better suited to hypothesis generation
direct systematic observation (that is, the what, than to hypothesis testing. When hypothesis test-
when, where, and how of observation) can be con- ing does occur, it is limited to the study of the rela-
trolled to varying degrees, as necessary, while still tionship(s) between/among behaviors, rather than
permitting behavior to occur naturally and over to the causal links between them, as is the focus of
time. Indeed, the control of what Roger Barker experimental methods with single or limited beha-
referred to as ‘‘the stream of behavior,’’ in his 1962 vioral observations and fully randomized designs.
book by that title, may range from a simple specifi- This entry discusses several aspects of observa-
cation of certain aspects of the context for tional research: its origins, the approaches, special
comparative purposes (e.g., diurnal vs. nocturnal considerations, and the future of observational
behaviors) to a full experimental design involving research.
the random assignment of participants to strictly
specified conditions.
Even the most casual observations have been
Origins
included among these research methods, but they
typically involve, at a minimum, a systematic pro- Historically, observational research has its roots in
cess of specifying, selecting, and sampling beha- the naturalistic observational methods of Charles
viors for observation. The behaviors considered Darwin and other naturalists studying nonhuman
949
950 Observational Research
animals. The work of these 19th-century scientists naturally in the absence of an observer. For exam-
spawned the field of ethology, which is defined as ple, anthropological linguists have observed the
the study of the behavior of animals in their natural hypercorrection of speech pronunciation ‘‘errors’’
habitats. Observational methods are the primary in lower- and working-class women when reading
research tools of the ethologist. In the study of a list of words to an experimenter compared to
human behaviors, a comparable approach is that of when speaking casually. Presumably, compared to
ethnography, which combines several research tech- upper-middle-class speakers, they felt a greater
niques (observations, interviews, and archival and/ need to ‘‘speak properly’’ when it was obvious that
or physical trace measures) in a long-term investiga- their pronunciation was the focus of attention.
tion of a group or culture. This technique also Although various techniques exist for limiting the
involves immersion and even participation in the effects of evaluation apprehension, obtrusive obser-
group being studied in a method commonly vational techniques can never fully guarantee the
referred to as participant observation. nonreactivity of their measurements.
The use of observational research methods of In the case of unobtrusive observation, partici-
various kinds can be found in all of the social pants in the research are not made aware that they
sciences—including, but not limited to, anthropol- are being observed (at least not at the time of
ogy, sociology, psychology, communication, politi- observation). This can effectively eliminate the
cal science, and economics—and in fields that problem of measurement reactivity, but it presents
range from business to biology, and from educa- another issue to consider when the research parti-
tion to entomology. These methods have been cipants are humans; namely, the ethics of making
applied in innumerable settings, from church ser- such observations. In practice, ethical considera-
vices to prisons to psychiatric wards to college tions have resulted in limits to the kinds of beha-
classrooms, to name a few. viors that can be observed unobtrusively, as well
as to the techniques (for example, the use of
recording devices) that can be employed. If the
Distinctions Among Methods
behavior occurs in a public place where the person
Whether studying humans or other animals, one of being observed cannot reasonably expect complete
the important distinctions among observational privacy, the observations may be considered accep-
research methods is whether the observer’s pres- table. Another guideline involves the notion of
ence is overt or obtrusive to the participants or minimal risk. Generally speaking, procedures that
covert or unobtrusive. In the former case, research- involve no greater risk to participants than they
ers must be wary of the problem of reactivity of might encounter in everyday life are considered
measurement; that is, of measurement procedures acceptable. Before making unobtrusive observa-
where the act of measuring may, in all likelihood, tions, researchers should take steps to solicit the
change the behavior being measured. Reactivity opinions of colleagues and others who might be
can operate in a number of ways. For example, the familiar with issues of privacy, confidentiality, and
physical space occupied by an observer under minimal risk in the kinds of situations involved in
a particular tree or in the corner of a room may the research. Research conducted at institutions
militate against the occurrence of the behaviors that receive federal funding will have an institu-
that would naturally occur in that particular loca- tional review board composed of researchers and
tion. More likely, at least in the case of the study community members who review research proto-
of human behavior, participants may attempt to cols involving human participants and who will
control their behaviors in order to project a certain assist researchers in determining appropriate ethi-
image. One notable example in this regard has cal procedures in these and other circumstances.
been termed evaluation apprehension. Specifically,
human participants who know that they are being
Special Considerations
observed might feel apprehensive about being
judged or evaluated and might attempt to behave Observational research approaches generally include
in ways that they believe put them in the most pos- many more observations or data points than typi-
itive light, as opposed to behaving as they would cal experimental approaches, but they, too, are
Observational Research 951
reductionistic in nature; that is, although relatively regarding behavior codes if the behaviors are
more behaviors are observed and assessed, not all observed and classified again at another time. In
behaviors that occur during data collection may be the case of interrater reliability, two (or more)
studied. This fact raises some special considerations. judges independently viewing the behaviors should
make the same classifications or judgments.
Although in practice, reliability estimates seldom
How Will the Behaviors
involve perfect agreement between judgments made
Being Studied Be Segmented?
at different times or by different coders, there are
Aristotle claimed that ‘‘natural’’ categories are standards of disagreement accepted by researchers
those that ‘‘carve at the joint.’’ Some behaviors do based upon the computations of certain descriptive
seem to segment relatively easily via their observ- and inferential statistics. The appropriate statistic(s)
able features, such as speaking turns in conversa- to use to make a determination of reliability depends
tion, or the beginning and end of an eye blink. For upon the nature of the codes/variables being used.
many other behaviors, beginnings and endings Correlations often are computed for continuous vari-
may not be so clear. Moreover, research has shown ables or codes (that is, for classifications that vary
that observers asked to segment behaviors into the along some continuum; for example, degrees of dis-
smallest units they found to be natural and mean- played aggression), and Cohen’s kappa coefficients
ingful formed different impressions than observers often are computed for discrete or categorical vari-
asked to segment behaviors into the largest units ables or codes; for example, types of hand gestures.
they found natural and meaningful, despite observ-
ing the same videotaped series of behaviors. The What Behaviors Will Be Sampled?
small-unit observers also were more confident of
their impressions. Consumers of observational res- The key to sampling is that there is a sufficient
earch findings should keep in mind that different amount and appropriate kind of sampling per-
strategies for segmenting behavior may result in formed such that one represents the desired popu-
different kinds of observations and inferences. lation of behaviors (and contexts and types of
participants) to which one would want to general-
ize. Various sampling procedures exist, as do sta-
How Will Behavior Be Classified or Coded? tistics to help one ascertain the number of
The central component of all observational sys- observations necessary to test the reliability of the
tems is sometimes called a behavior code, which is measurement scheme employed and/or test hypo-
a detailed description of the behaviors and/or theses about the observations (for example, power
events to be observed and recorded. Often, this analyses and tests of effect size).
code is referred to as a taxonomy of behavior. The
best taxonomies consist of a set of categories with Problems Associated With
the features of being mutually exclusive (that is,
Observational Research
every instance of an observed behavior fits into
one and only one category of the taxonomy) and Despite all of the advantages inherent in making
exhaustive (that is, every instance of an observed observations of ongoing behavior, a number of
behavior fits into one of the available categories of problems are typical of this type of research. Pro-
the taxonomy). minent among them is the fact that the develop-
ment and implementation of reliable codes can be
time-consuming and expensive, often requiring
Are the Classifications of
huge data sets to achieve representative samples
Observed Behaviors Reliable Ones?
and the use of recording equipment to facilitate
The coding of behaviors according to the cate- reliable measurement. Special methods may be
gories of a taxonomy have, as a necessary condi- needed to prevent, or at least test for, what has
tion, that the coding judgments are reliable ones. been called observer drift. This term refers to the
In the case of intrarater reliability, this means that fact that, with prolonged observations, observers
an observer should make the same judgments may be more likely to forget coding details,
952 Observations
become fatigued, experience decreased motivation Hawkins, R. P. (1982). Developing a behavior code. In
and attention, and/or learn confounding habits. D. P. Hartmann (Ed.), Using observers to study
Finally, observational methods cannot be applied behavior. San Francisco: Jossey-Bass.
to hypotheses concerning phenomena not suscepti- Jones, R. (1985). Research methods in the social and
behavioral sciences. Sunderland, MA: Sinauer
ble to direct observation, such as cognitive or
Associates, Inc.
affective variables. Indeed, care must be taken by Longabaugh, R. (1980). The systematic observation of
researchers to be sure that actual observations behavior in naturalistic settings. In H. Triandis (Ed.),
(e.g., he smiled or the corners of his mouth were The handbook of cross-cultural psychology: II,
upturned or the zygomaticus major muscle was Methodology. Boston: Allyn & Bacon.
contracted) and not inferences (e.g., he was happy) Magnusson, M. S. (2005). Understanding social
are recorded as data. interaction: Discovering hidden structure with model
and algorithms. In L. Anolli, S. Duncan, Jr., M. S.
Magnusson, & G. Riva (Eds.), The hidden structure of
Future Outlook interaction. Amsterdam: IOS Press.
Suen, H. K., & Ary, D. (1989). Analyzing quantitative
With the increasing availability and sophistication behavioral observation data. Mahwah, NJ: Lawrence
of computer technology, researchers employing Erlbaum.
observational research methods have been able to Wilkinson, L., & the Task Force on Statistical Inference.
search for more complicated patterns of behavior, (1999). Statistical methods in psychology journals.
not just within an individual’s behavior over time, American Psychologist, 54, 594604.
but among interactants in dyads and groups as
well. Whether the topic is family interaction pat-
terns, courtship behaviors in Drosophila, or pat-
terns of nonverbal behavior in doctor-patient
interactions, a collection of multivariate statistical
OBSERVATIONS
tools, including factor analyses, time-series analy-
ses, and t-pattern analyses, has become available Observations refer to watching and recording the
to the researcher to assist him or her in detecting occurrence of specific behaviors during an episode
the hidden yet powerful patterns of behavior that of interest. The observational method can be
are available for observation. employed in the laboratory as well as a wide vari-
ety of other settings to obtain a detailed picture of
Carol Toris how behavior unfolds. This entry discusses types
of observational design, methods for collecting
See also Cause and Effect; Cohen’s Kappa; Correlation; observations, and potential pitfalls that may be
Effect Size, Measures of; Experimental Design; encountered.
Hypothesis; Laboratory Experiments; Multivariate
Analysis of Variance (MANOVA); Naturalistic
Observation; Power Analysis; Reactive Arrangements; Types of Observational Designs
Reliability; Sample Size Planning; Unit of Analysis
There are two types of observational design: natu-
Further Readings ralistic and laboratory observations. Naturalistic
observations entail watching and recording beha-
Barker, R. G. (Ed.). (1963). The stream of behavior: viors in everyday environments such as animal col-
Explorations of its structure and content. New York: onies, playgrounds, classrooms, and retail settings.
Appleton-Century-Crofts. The main advantage of naturalistic observation is
Campbell, D. T., & Stanley, J. (1966). Experimental and
that it affords researchers the opportunity to study
quasi-experimental designs for research. Chicago:
Rand McNally.
the behavior of animals and people in their natural
Ekman, P. (1982). Methods for measuring facial action. settings. Disadvantages associated with naturalistic
In K. R. Scherer & P. Ekman (Eds.), Handbook of observations are lack of control over the setting;
methods in nonverbal behavior research. Cambridge, thus, confounding factors may come into play.
UK: Cambridge University Press. Also, the behavior of interest may be extremely
Observations 953
including utilization of adaptation periods during selected over one with many, provided they are
which observers immerse themselves in the envi- equally functional. Likewise, a straightforward
ronment prior to data collection so that the sub- explanation ought to be believed over one that
jects of their observation become accustomed to requires many separate contingencies.
their presence. For instance, there are a number of possible rea-
Another potential danger is observer bias, in sons why a light bulb does not turn on when
which observers’ knowledge of the study hypothe- a switch is flipped: Aliens could have abducted the
ses influences their recording of behavior. Obser- light bulb, the power could be out, or the filament
vers may notice and note more behavior that is within the bulb has burned out. The explanation
congruent with the study hypotheses than actually requiring aliens is exceedingly complex, as it neces-
occurs. At the same time, they may not notice and sitates the existence of an unknown life form,
note behavior that is incongruent with the study a planet from which they have come, a motive for
hypotheses. One means of lessening observer bias taking light bulbs, and so on. A power outage is
is to limit the information given to observers not as complicated, but still requires an intricate
regarding the study hypotheses. chain of events, such as a storm, accident, or engi-
neering problem. The simplest of these theories is
Lisa H. Rosen and Marion K. Underwood that the light bulb has simply burned out. All theo-
ries provide explanations, but vary in complexity.
See also Hawthorne Effect; Interrater Reliability;
Until proof corroborating one account surfaces,
Naturalistic Observation; Observational Research
Occam’s Razor requires that the simplest explana-
tion be preferred above the others. Thus, the
Further Readings logical—and most likely correct—hypothesis is
that the light bulb has burned out.
Dallos, R. (2006). Observational methods. In This entry begins with a brief history of Occam’s
G. Breakwell, S. Hammond, C. Fife-Schaw, & J. Smith
Razor. It then discusses the implications for res-
(Eds.), Research methods in psychology
(pp. 124145). Thousand Oaks, CA: Sage.
earch. The entry concludes with some caveats
Margolin, G., Oliver, P. H., Gordis, E. B., O’Hearn, H. G., related to the use of Occam’s Razor.
Medina, A. M., Ghosh, C. M., & Morland, L. (1998).
The nuts and bolts of behavioral observation of marital History
and family interaction. Clinical Child and Family
Psychology Review, 1, 195213. Occam’s Razor is named for the 14th-century
Pope, C., & Mays, N. (2006). Observational methods. In English theologian, philosopher, and friar William
C. Pope & N. Mays (Eds.), Qualitative research in of Occam. William, who was presumably from the
health care (pp. 3242). Malden, MA: Blackwell. city of Occam, famously suggested that ‘‘entities
should not be multiplied beyond necessity.’’ To do
so, he explained, implied vanity and needlessly
increased the chances of error. This principle had
OCCAM’S RAZOR been formalized since the time of Aristotle, but
Occam’s unabashed and consistent use of the
Occam’s Razor (also spelled Ockham) is known as Razor helped Occam become one of the foremost
the principle of parsimony or the economy of critics of Thomas Aquinas.
hypotheses. It is a philosophical principle dictating
that, all things being equal, simplicity is preferred
Implications for Scientific Research
over complexity. Traditionally, the Razor has been
used as a philosophical heuristic for choosing The reasons for emphasizing simplicity when con-
between competing theories, but the principle is ceptualizing and conducting research may seem
also useful for defining methods for empirical obvious. Simple designs reduce the chance of
inquiry, selecting scientific hypotheses, and refining experimenter error, increase the clarity of the
statistical models. According to Occam’s Razor, results, obviate needlessly complex statistical
a tool with fewer working parts ought to be analyses, conserve valuable resources, and curtail
Occam’s Razor 955
potential confounds. As such, the Razor can be with two independent variables. Because the third
a helpful guide when attempting to produce an variable does not contribute information to the
optimal research design. Although often imple- model, Occam’s Razor can be used to cut it away.
mented intuitively, it may be helpful to review and
refine proposed research methods with the Razor
in mind. Caveats
Just as any number of tools can be used to
accomplish a particular job, there are many poten- In practice, a strict adherence to Occam’s Razor is
tial methodological designs for each research ques- usually impossible or ill-advised, as it is rare to
tion. Occam’s Razor suggests that a tool with find any two models or theories that are equivalent
fewer working parts is preferable to one that is in all ways except complexity. Often, when some
needlessly complicated. A correlation design that portion of a method, hypothesis, or theory is cut
necessitates only examining government records away, some explanatory or logical value must be
may be more appropriate than an experimental sacrificed. In the previously mentioned case of the
design that necessitates recruiting, assigning, mani- regression analysis, the addition of a third indepen-
pu